Skip to content

Zoom-in: KV Cache

Karify98·
Cover Image for Zoom-in: KV Cache

When you chat with a Large Language Model, the characters appear on the screen continuously and smoothly. However, few know that to generate the next word, the model must perform billions of matrix multiplication calculations from scratch.

Without the help of the Key-Value Cache (KV Cache), the model's response speed would slow down to a crawl as the conversation gets longer.

Let's zoom in on hardware architecture and inference algorithms.


Layer 1 — The Problem: Auto-Regressive Property and Redundant Compute

Language models are auto-regressive. To generate the $N$-th token, the model needs to read the entire initial input plus the $N-1$ tokens it just generated.

For example, to write the sentence "Machine learning is fun":

  • Step 1: Input Machine $\to$ Output learning
  • Step 2: Input Machine learning $\to$ Output is
  • Step 3: Input Machine learning is $\to$ Output fun

In the Transformer network architecture, the heart of the model is the self-attention mechanism. It helps the model understand relationships between words in a sentence by multiplying three vectors: Query (Q), Key (K), and Value (V) for each token.

Every new token entering the model needs:
- A Query vector (Q) to ask.
- A Key vector (K) to match information against.
- A Value vector (V) to extract content.

At step 2, when processing the word learning, the model must recalculate Query, Key, and Value vectors for the word Machine, even though it was already calculated in step 1. As the conversation grows to thousands of words, this redundancy scales quadratically ($O(N^2)$), causing the model to waste compute recalculating known states.


Layer 2 — The Solution: Calculate Once, Store Always

The Key-Value Cache (KV Cache) was introduced to break this redundant cycle. Its concept is highly intuitive: What has been calculated once is cached, never calculated again.

sequenceDiagram
    participant LLM as Language Model
    participant Cache as KV Cache (VRAM)
    
    Note over LLM, Cache: Processing token 1: 'Machine'
    LLM->>Cache: Save Key and Value vectors of 'Machine'
    Note over LLM, Cache: Generating token 2: 'learning'
    LLM->>LLM: Only calculate Q, K, V for 'learning'
    LLM->>Cache: Fetch Key/Value of 'Machine' + Save Key/Value of 'learning'
    LLM->>LLM: Compute Attention and generate next token

With this mechanism:

  • The model only calculates all three vectors for the newly appeared token at the current step.
  • For all previous tokens, the model simply retrieves their Key and Value vectors directly from the cache.

This optimization eliminates the need to recalculate Key and Value vectors for past tokens, reducing the projection overhead for historical tokens to a constant $O(1)$ cost. However, the model still needs to compute attention weights between the new query and the entire cached sequence, meaning the per-step attention complexity itself still scales linearly ($O(N)$) with the sequence length. This is why AI can respond much faster than computing everything from scratch.

KV Cache trades memory space to save the computational load on the GPU.

Layer 3 — The Trade-Off: Graphics Memory Crisis

While freeing up GPU compute capacity, the KV Cache creates another massive challenge: It consumes an enormous amount of Video RAM (VRAM).

The size of this cache grows linearly with:

  1. Batch size: The number of users chatting concurrently with the model.
  2. Context length: The total length of the conversation history.
  3. Model architecture: The number of network layers and attention heads.

[!WARNING] For a medium-sized model like Llama-3-8B running at a context length of 8,000 tokens with a batch size of 16, the KV Cache alone consumes about 8GB of VRAM – roughly equivalent to the memory required to load the entire weights of the model itself.

If the cache size exceeds physical hardware limits, the system hits an Out of Memory (OOM) error and crashes completely. This is why cache optimization techniques such as Grouped-Query Attention (GQA) or PagedAttention (managing the cache similarly to virtual memory in operating systems) are currently among the hottest topics for LLM performance engineers.


Full picture

sequenceDiagram
    participant Tokenizer as Tokenizer
    participant GPU as GPU Processing
    participant VRAM as KV Cache (VRAM)
    
    Note over Tokenizer, VRAM: First execution (Prefill Phase)
    Tokenizer->>GPU: Inputs prompt (N tokens)
    GPU->>GPU: Computes Q, K, V for all N tokens
    GPU->>VRAM: Saves all Key & Value vectors of N tokens into cache
    GPU-->>Tokenizer: Generates next token (N+1)
    
    Note over Tokenizer, VRAM: Subsequent token generations (Decode Phase)
    Tokenizer->>GPU: Inputs latest token (N+1)
    GPU->>GPU: Computes Q, K, V only for token N+1
    VRAM-->>GPU: Fetches old Key & Value vectors from 1 to N
    GPU->>GPU: Fast attention computation
    GPU->>VRAM: Saves Key & Value of token N+1 into cache
    GPU-->>Tokenizer: Generates token N+2

Takeaway

KV Cache is a core latency optimization technique that prevents LLMs from redundant Key and Value vector calculations on historical tokens during auto-regressive decoding. By saving these matrices in VRAM, the projection overhead for past tokens is reduced to a flat $O(1)$ cost. Although the per-step attention computation still scales linearly ($O(N)$) with sequence length, this optimization eliminates a massive amount of matrix multiplication. This speedup comes at the cost of VRAM consumption, driving the industry to develop memory-efficient layouts like GQA and PagedAttention.

Related Posts