Zoom-in: Quantization

A large language model like Llama-3 with 70 billion parameters typically requires a dedicated server infrastructure with multiple high-end GPUs to run. However, today you can run this model smoothly right on a personal laptop.
The magic behind this feat is quantization – the art of "slimming down" a model by reducing the mathematical precision of its weights.
Let's zoom in on the storage structure of large language models.
Layer 1 — The Problem: The Massive Weight of Parameters
Inside the model, the weights that determine the system's knowledge are stored as floating-point real numbers. By default, models are trained at 16-bit (Float16) or 32-bit (Float32) precision.
- At 16-bit precision, each weight consumes 2 bytes of memory.
- For a 70-billion-parameter model, the minimum VRAM required to load the model is: $$70 \times 10^9 \times 2 \text{ bytes} \approx 140\text{GB}$$
This exceeds the hardware capacity of almost all standard personal computers. To run the model at home, we must find a way to shrink the size of these numbers.
Layer 2 — The Solution: Mapping from Continuous to Discrete
Quantization is the process of converting weight representations from high-precision floating-point types (like 16-bit Float16) to lower-precision integer types (like 8-bit INT8 or 4-bit INT4).
Imagine you have a 1-meter ruler that can measure down to a micrometer (equivalent to floating-point numbers). Now, you are asked to round all measurements and only use 256 fixed tick marks (equivalent to an 8-bit integer with $2^8 = 256$ states).
Float16 sequence: [-3.1415, -1.2045, 0.0000, 1.8902, 2.7182]
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
INT4 sequence: [-8, -3, 0, 5, 7]
To perform this, we use a scale factor and a zero-point to map the continuous range of real numbers to discrete integer steps: $$\text{Quantized Weight} = \text{round}\left( \frac{\text{Raw Weight}}{\text{Scale}} \right) + \text{Zero-point}$$
Through this conversion, each weight goes from consuming 2 bytes (16-bit) to just 1 byte (8-bit) or 0.5 bytes (4-bit). The 70B model's footprint shrinks from 140GB of graphics memory to approximately 35GB in its 4-bit version, making it fit within consumer-grade personal computers.
Layer 3 — The Trade-Off: Perplexity and Popular Formats
Rounding numbers inevitably introduces errors. In artificial intelligence, this error is measured by perplexity – an index representing the model's confusion when predicting the next token. An increase in perplexity means a slight drop in the model's accuracy.
However, empirical studies show that reducing a model to 4-bit only increases perplexity by a negligible margin (less than 1%), while the benefits of memory savings are immense.
graph TD
A[Original Float16 Model: 140GB] -->|Quantization| B(INT8: 70GB - Retains full intelligence)
A -->|Quantization| C(INT4: 35GB - Sweet spot for local computers)
A -->|Quantization| D(INT2: 17.5GB - Severe accuracy degradation starts)
style A fill:#1e293b,stroke:#475569,color:#cbd5e1
style B fill:#1e293b,stroke:#475569,color:#cbd5e1
style C fill:#1e293b,stroke:#475569,color:#cbd5e1
style D fill:#1e293b,stroke:#475569,color:#cbd5e1
Currently, there are three popular quantization formats you will encounter when working with local models:
- GGUF: The most optimized format for running models on CPUs or Apple Silicon (Macs). This format supports hybrid loading (splitting the workload between CPU and GPU).
- AWQ (Activation-aware Weight Quantization): Only quantizes less critical weights deeply, retaining high precision for core weights. AWQ runs exceptionally fast on dedicated GPUs.
- GPTQ: A Post-Training Quantization (PTQ) method that optimizes error matrices, delivering ultra-fast inference speeds on Nvidia graphics cards.
Full picture
graph TD
FloatWeight[Floating-point Weights: Float16 / Float32] -->|Quantization Process| Mapping[Linear Mapping: Multiply by Scale & Add Zero-point]
Mapping -->|Round to nearest integer| IntWeight[Integer Weights: INT8 / INT4]
subgraph Hardware Deployment Targets
IntWeight -->|GGUF Format| CPU[CPU & Apple Silicon Macbook]
IntWeight -->|AWQ Format| GPU_AWQ[Dedicated GPUs - Activation-aware quantization]
IntWeight -->|GPTQ Format| GPU_GPTQ[Nvidia GPUs - Error matrix optimization]
end
Takeaway
Quantization is a key optimization technique that makes it possible to run massive Large Language Models (LLMs) on resource-constrained consumer hardware. By rounding and storing floating-point weights (like 16-bit Float16) as low-precision integers (like 8-bit or 4-bit INT8/INT4), the model size is compressed by 3 to 4 times. While rounding introduce a small increase in model perplexity, this loss is negligible compared to the massive reduction in VRAM consumption, democratizing local AI execution via formats like GGUF, AWQ, and GPTQ.
Related Posts
Zoom-in: Rate Limiter
You send too many API requests, and the system responds with '429 Too Many Requests'. How does the Rate Limiter gatekeeper protect system resources?
Zoom-in: WebSocket
Your chat app updates instantly without reloading the page. How WebSocket breaks free from the one-way limits of HTTP.
Zoom-in: Virtual Memory
Run multiple apps at once, and each one acts like it owns all of your RAM. How does the operating system isolate memory space so securely?