Zoom-in: Quantization

A large language model like Llama-3 with 70 billion parameters typically requires a dedicated server infrastructure with multiple high-end GPUs to run. However, today you can run this model smoothly right on a personal laptop.

The magic behind this feat is quantization – the art of "slimming down" a model by reducing the mathematical precision of its weights.

Let's zoom in on the storage structure of large language models.

Layer 1 — The Problem: The Massive Weight of Parameters

Inside the model, the weights that determine the system's knowledge are stored as floating-point real numbers. By default, models are trained at 16-bit (Float16) or 32-bit (Float32) precision.

At 16-bit precision, each weight consumes 2 bytes of memory.
For a 70-billion-parameter model, the minimum VRAM required to load the model is: $$70 \times 10^9 \times 2 \text{ bytes} \approx 140\text{GB}$$

This exceeds the hardware capacity of almost all standard personal computers. To run the model at home, we must find a way to shrink the size of these numbers.

Layer 2 — The Solution: Mapping from Continuous to Discrete

Quantization is the process of converting weight representations from high-precision floating-point types (like 16-bit Float16) to lower-precision integer types (like 8-bit INT8 or 4-bit INT4).

Imagine you have a 1-meter ruler that can measure down to a micrometer (equivalent to floating-point numbers). Now, you are asked to round all measurements and only use 256 fixed tick marks (equivalent to an 8-bit integer with $2^8 = 256$ states).

Float16 sequence:    [-3.1415,  -1.2045,  0.0000,  1.8902,  2.7182]
                           │         │       │        │        │
                           ▼         ▼       ▼        ▼        ▼
INT4 sequence:       [-8,       -3,      0,       5,       7]

To perform this, we use a scale factor and a zero-point to map the continuous range of real numbers to discrete integer steps: $$\text{Quantized Weight} = \text{round}\left( \frac{\text{Raw Weight}}{\text{Scale}} \right) + \text{Zero-point}$$

Through this conversion, each weight goes from consuming 2 bytes (16-bit) to just 1 byte (8-bit) or 0.5 bytes (4-bit). The 70B model's footprint shrinks from 140GB of graphics memory to approximately 35GB in its 4-bit version, making it fit within consumer-grade personal computers.

→ Quantization reduces model sizes by 3 to 4 times without requiring expensive hardware upgrades.

Layer 3 — The Trade-Off: Perplexity and Popular Formats

Rounding numbers inevitably introduces errors. In artificial intelligence, this error is measured by perplexity – an index representing the model's confusion when predicting the next token. An increase in perplexity means a slight drop in the model's accuracy.

However, empirical studies show that reducing a model to 4-bit only increases perplexity by a negligible margin (less than 1%), while the benefits of memory savings are immense.

graph TD
    A[Original Float16 Model: 140GB] -->|Quantization| B(INT8: 70GB - Retains full intelligence)
    A -->|Quantization| C(INT4: 35GB - Sweet spot for local computers)
    A -->|Quantization| D(INT2: 17.5GB - Severe accuracy degradation starts)
    style A fill:#1e293b,stroke:#475569,color:#cbd5e1
    style B fill:#1e293b,stroke:#475569,color:#cbd5e1
    style C fill:#1e293b,stroke:#475569,color:#cbd5e1
    style D fill:#1e293b,stroke:#475569,color:#cbd5e1

Currently, there are three popular quantization formats you will encounter when working with local models:

GGUF: The most optimized format for running models on CPUs or Apple Silicon (Macs). This format supports hybrid loading (splitting the workload between CPU and GPU).
AWQ (Activation-aware Weight Quantization): Only quantizes less critical weights deeply, retaining high precision for core weights. AWQ runs exceptionally fast on dedicated GPUs.
GPTQ: A Post-Training Quantization (PTQ) method that optimizes error matrices, delivering ultra-fast inference speeds on Nvidia graphics cards.

→ Recommendation: If running on a personal computer without a high-end GPU, prioritize the 4-bit GGUF format (denoted as Q4_K_M). It offers the best balance between resource consumption and model accuracy.

Full picture

graph TD
    FloatWeight[Floating-point Weights: Float16 / Float32] -->|Quantization Process| Mapping[Linear Mapping: Multiply by Scale & Add Zero-point]
    Mapping -->|Round to nearest integer| IntWeight[Integer Weights: INT8 / INT4]
    
    subgraph Hardware Deployment Targets
        IntWeight -->|GGUF Format| CPU[CPU & Apple Silicon Macbook]
        IntWeight -->|AWQ Format| GPU_AWQ[Dedicated GPUs - Activation-aware quantization]
        IntWeight -->|GPTQ Format| GPU_GPTQ[Nvidia GPUs - Error matrix optimization]
    end

Takeaway

Quantization is a key optimization technique that makes it possible to run massive Large Language Models (LLMs) on resource-constrained consumer hardware. By rounding and storing floating-point weights (like 16-bit Float16) as low-precision integers (like 8-bit or 4-bit INT8/INT4), the model size is compressed by 3 to 4 times. While rounding introduce a small increase in model perplexity, this loss is negligible compared to the massive reduction in VRAM consumption, democratizing local AI execution via formats like GGUF, AWQ, and GPTQ.

Zoom-in: Quantization

Layer 1 — The Problem: The Massive Weight of Parameters

Layer 2 — The Solution: Mapping from Continuous to Discrete

Layer 3 — The Trade-Off: Perplexity and Popular Formats

Full picture

Takeaway

Related Posts

Zoom-in: Rate Limiter

Zoom-in: WebSocket

Zoom-in: Virtual Memory