Zoom-in: KV Cache

When you chat with a Large Language Model, the characters appear on the screen continuously and smoothly. However, few know that to generate the next word, the model must perform billions of matrix multiplication calculations from scratch.
Without the help of the Key-Value Cache (KV Cache), the model's response speed would slow down to a crawl as the conversation gets longer.
Let's zoom in on hardware architecture and inference algorithms.
Layer 1 — The Problem: Auto-Regressive Property and Redundant Compute
Language models are auto-regressive. To generate the $N$-th token, the model needs to read the entire initial input plus the $N-1$ tokens it just generated.
For example, to write the sentence "Machine learning is fun":
- Step 1: Input
Machine$\to$ Outputlearning - Step 2: Input
Machine learning$\to$ Outputis - Step 3: Input
Machine learning is$\to$ Outputfun
In the Transformer network architecture, the heart of the model is the self-attention mechanism. It helps the model understand relationships between words in a sentence by multiplying three vectors: Query (Q), Key (K), and Value (V) for each token.
Every new token entering the model needs:
- A Query vector (Q) to ask.
- A Key vector (K) to match information against.
- A Value vector (V) to extract content.
At step 2, when processing the word learning, the model must recalculate Query, Key, and Value vectors for the word Machine, even though it was already calculated in step 1. As the conversation grows to thousands of words, this redundancy scales quadratically ($O(N^2)$), causing the model to waste compute recalculating known states.
Layer 2 — The Solution: Calculate Once, Store Always
The Key-Value Cache (KV Cache) was introduced to break this redundant cycle. Its concept is highly intuitive: What has been calculated once is cached, never calculated again.
sequenceDiagram
participant LLM as Language Model
participant Cache as KV Cache (VRAM)
Note over LLM, Cache: Processing token 1: 'Machine'
LLM->>Cache: Save Key and Value vectors of 'Machine'
Note over LLM, Cache: Generating token 2: 'learning'
LLM->>LLM: Only calculate Q, K, V for 'learning'
LLM->>Cache: Fetch Key/Value of 'Machine' + Save Key/Value of 'learning'
LLM->>LLM: Compute Attention and generate next token
With this mechanism:
- The model only calculates all three vectors for the newly appeared token at the current step.
- For all previous tokens, the model simply retrieves their Key and Value vectors directly from the cache.
This optimization eliminates the need to recalculate Key and Value vectors for past tokens, reducing the projection overhead for historical tokens to a constant $O(1)$ cost. However, the model still needs to compute attention weights between the new query and the entire cached sequence, meaning the per-step attention complexity itself still scales linearly ($O(N)$) with the sequence length. This is why AI can respond much faster than computing everything from scratch.
Layer 3 — The Trade-Off: Graphics Memory Crisis
While freeing up GPU compute capacity, the KV Cache creates another massive challenge: It consumes an enormous amount of Video RAM (VRAM).
The size of this cache grows linearly with:
- Batch size: The number of users chatting concurrently with the model.
- Context length: The total length of the conversation history.
- Model architecture: The number of network layers and attention heads.
[!WARNING] For a medium-sized model like Llama-3-8B running at a context length of 8,000 tokens with a batch size of 16, the KV Cache alone consumes about 8GB of VRAM – roughly equivalent to the memory required to load the entire weights of the model itself.
If the cache size exceeds physical hardware limits, the system hits an Out of Memory (OOM) error and crashes completely. This is why cache optimization techniques such as Grouped-Query Attention (GQA) or PagedAttention (managing the cache similarly to virtual memory in operating systems) are currently among the hottest topics for LLM performance engineers.
Full picture
sequenceDiagram
participant Tokenizer as Tokenizer
participant GPU as GPU Processing
participant VRAM as KV Cache (VRAM)
Note over Tokenizer, VRAM: First execution (Prefill Phase)
Tokenizer->>GPU: Inputs prompt (N tokens)
GPU->>GPU: Computes Q, K, V for all N tokens
GPU->>VRAM: Saves all Key & Value vectors of N tokens into cache
GPU-->>Tokenizer: Generates next token (N+1)
Note over Tokenizer, VRAM: Subsequent token generations (Decode Phase)
Tokenizer->>GPU: Inputs latest token (N+1)
GPU->>GPU: Computes Q, K, V only for token N+1
VRAM-->>GPU: Fetches old Key & Value vectors from 1 to N
GPU->>GPU: Fast attention computation
GPU->>VRAM: Saves Key & Value of token N+1 into cache
GPU-->>Tokenizer: Generates token N+2
Takeaway
KV Cache is a core latency optimization technique that prevents LLMs from redundant Key and Value vector calculations on historical tokens during auto-regressive decoding. By saving these matrices in VRAM, the projection overhead for past tokens is reduced to a flat $O(1)$ cost. Although the per-step attention computation still scales linearly ($O(N)$) with sequence length, this optimization eliminates a massive amount of matrix multiplication. This speedup comes at the cost of VRAM consumption, driving the industry to develop memory-efficient layouts like GQA and PagedAttention.
Related Posts
Zoom-in: Rate Limiter
You send too many API requests, and the system responds with '429 Too Many Requests'. How does the Rate Limiter gatekeeper protect system resources?
Zoom-in: WebSocket
Your chat app updates instantly without reloading the page. How WebSocket breaks free from the one-way limits of HTTP.
Zoom-in: Virtual Memory
Run multiple apps at once, and each one acts like it owns all of your RAM. How does the operating system isolate memory space so securely?