Zoom-in: Decoding Parameters

When you send the same prompt to a Large Language Model multiple times, you get different answers – sometimes highly creative, other times strictly factual. How can a dry mathematical algorithm be "creative" and shift its persona so dynamically?

It all happens behind the scenes of word selection, controlled by three decoding parameters: Temperature, Top-P, and Top-K.

Let's zoom in on the final step before the model outputs a word.

Layer 1 — Logits and Softmax: The Raw Probability Table

Once the model finished computing the next token, its output is not a specific word but a list of raw, unnormalized scores (logits) for every word in its vocabulary dictionary.

This list of scores is then passed through the softmax function to convert it into a probability distribution that sums up to 1 (or 100%).

For example, after the sequence: "I want to drink a cup of...", the output probability distribution might look like:

coffee: 45%
water: 30%
tea: 15%
gasoline: 0.001%

If the model always selects the word with the highest probability (coffee), this is called greedy search. This makes the response highly repetitive, monotonic, and unnatural. To introduce natural variety, we must inject randomness into the selection process.

Layer 2 — Temperature: Flattening or Steeping the Probabilities

Temperature ($T$) is a mathematical scaling factor used to adjust the raw scores before they are passed to the softmax function.

The formula for modifying logits: $$\text{Scaled Logit} = \frac{\text{Raw Logit}}{T}$$

graph TD
    A[Logits: Raw Scores] -->|Divided by Temperature T| B[New Logits]
    B -->|Passed to Softmax| C[Probability Distribution]
    
    subgraph Temperature Impact
    D[Low T: Steep Distribution - Peak words dominate]
    E[High T: Flat Distribution - Words share equal chance]
    end

Low Temperature (approaching 0): High scores are amplified dramatically compared to low scores. The probability distribution becomes very steep. The model will almost always choose the top word. The output becomes highly deterministic, logical, and consistent (ideal for writing code or solving math).
High Temperature (above 1.0): The gap between logits shrinks. The probability distribution flattens. Words with low initial probability (like gasoline in the example above) suddenly stand a much higher chance of being selected. The output becomes highly creative and surprising, but is prone to hallucinations or complete nonsense.

→ The higher the temperature, the more willing the model is to sample less common words.

Layer 3 — Top-K and Top-P: The Intelligent Filters

If you only increase the temperature, the model is likely to select completely nonsensical words. Therefore, we combine it with two intelligent filters, Top-K and Top-P, to restrict the selection pool.

1. Top-K (Limiting Quantity)

This filter instructs the model to only select from the top $K$ words with the highest probability.

For example: If $K = 50$, the model completely discards words ranked 51st and below, no matter how high the temperature is scaled. This guarantees that highly out-of-context words are never selected.

2. Top-P (Limiting Cumulative Probability)

Also known as nucleus sampling. Instead of picking a fixed number of words like Top-K, Top-P selects a dynamic set of top words whose cumulative probability reaches the threshold $P$ (e.g., $P = 0.9$ or 90%).

Assume we have the following probabilities:
- Word A: 60%
- Word B: 25%
- Word C: 10%
- Word D: 5%

With P = 0.9 (90%):
The model accumulates: A (60%) + B (25%) = 85% (not enough) -> adds C (10%) = 95% (exceeded 90%).
The model will only sample from [A, B, C]. Word D is discarded completely.

The beauty of Top-P lies in its adaptability: if the top word has an extremely high probability (e.g., the first word already occupies 95%), the model will restrict its choice to just that word. Conversely, if probabilities are evenly spread out, the selection pool expands automatically to encourage creative outputs.

→ Recommendation: For coding and logic, drop Temperature to 0.0. For brainstorming and creative writing, set Temperature to 0.8 and Top-P to 0.95.

Full picture

graph TD
    Logits[Logits: Vocabulary raw scores] -->|1. Divide by Temperature| Scaled[Scaled Logits]
    Scaled -->|2. Softmax Function| Prob[Vocabulary Probability Distribution]
    Prob -->|3. Top-K Filter| FilterK[Limit to K highest probability options]
    FilterK -->|4. Top-P Filter| FilterP[Limit to cumulative probability P]
    FilterP -->|5. Random Sampling| Selection[Select 1 token and generate output]
    
    style Logits fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Scaled fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Prob fill:#1e293b,stroke:#475569,color:#cbd5e1
    style FilterK fill:#1e293b,stroke:#475569,color:#cbd5e1
    style FilterP fill:#1e293b,stroke:#475569,color:#cbd5e1
    style Selection fill:#1e293b,stroke:#475569,color:#cbd5e1

Takeaway

Controlling the creativity and predictability of an LLM does not occur deep within the neural network's weights, but is entirely determined by the probability filters applied during the final decoding stage. By tuning how the probability distribution is scaled (Temperature) and dynamically truncated (Top-K/Top-P), developers can steer a single static model between two operational extremes: precise and deterministic for strict logic/coding, or diverse and surprising for creative brainstorming.

Zoom-in: Decoding Parameters

Layer 1 — Logits and Softmax: The Raw Probability Table

Layer 2 — Temperature: Flattening or Steeping the Probabilities

Layer 3 — Top-K and Top-P: The Intelligent Filters

1. Top-K (Limiting Quantity)

2. Top-P (Limiting Cumulative Probability)

Full picture

Takeaway

Related Posts

Zoom-in: LLM

Zoom-in: Rate Limiter

Zoom-in: WebSocket