Zoom-in: Decoding Parameters

When you send the same prompt to a Large Language Model multiple times, you get different answers – sometimes highly creative, other times strictly factual. How can a dry mathematical algorithm be "creative" and shift its persona so dynamically?
It all happens behind the scenes of word selection, controlled by three decoding parameters: Temperature, Top-P, and Top-K.
Let's zoom in on the final step before the model outputs a word.
Layer 1 — Logits and Softmax: The Raw Probability Table
Once the model finished computing the next token, its output is not a specific word but a list of raw, unnormalized scores (logits) for every word in its vocabulary dictionary.
This list of scores is then passed through the softmax function to convert it into a probability distribution that sums up to 1 (or 100%).
For example, after the sequence: "I want to drink a cup of...", the output probability distribution might look like:
coffee: 45%water: 30%tea: 15%gasoline: 0.001%
If the model always selects the word with the highest probability (coffee), this is called greedy search. This makes the response highly repetitive, monotonic, and unnatural. To introduce natural variety, we must inject randomness into the selection process.
Layer 2 — Temperature: Flattening or Steeping the Probabilities
Temperature ($T$) is a mathematical scaling factor used to adjust the raw scores before they are passed to the softmax function.
The formula for modifying logits: $$\text{Scaled Logit} = \frac{\text{Raw Logit}}{T}$$
graph TD
A[Logits: Raw Scores] -->|Divided by Temperature T| B[New Logits]
B -->|Passed to Softmax| C[Probability Distribution]
subgraph Temperature Impact
D[Low T: Steep Distribution - Peak words dominate]
E[High T: Flat Distribution - Words share equal chance]
end
- Low Temperature (approaching 0): High scores are amplified dramatically compared to low scores. The probability distribution becomes very steep. The model will almost always choose the top word. The output becomes highly deterministic, logical, and consistent (ideal for writing code or solving math).
- High Temperature (above 1.0): The gap between logits shrinks. The probability distribution flattens. Words with low initial probability (like
gasolinein the example above) suddenly stand a much higher chance of being selected. The output becomes highly creative and surprising, but is prone to hallucinations or complete nonsense.
Layer 3 — Top-K and Top-P: The Intelligent Filters
If you only increase the temperature, the model is likely to select completely nonsensical words. Therefore, we combine it with two intelligent filters, Top-K and Top-P, to restrict the selection pool.
1. Top-K (Limiting Quantity)
This filter instructs the model to only select from the top $K$ words with the highest probability.
- For example: If $K = 50$, the model completely discards words ranked 51st and below, no matter how high the temperature is scaled. This guarantees that highly out-of-context words are never selected.
2. Top-P (Limiting Cumulative Probability)
Also known as nucleus sampling. Instead of picking a fixed number of words like Top-K, Top-P selects a dynamic set of top words whose cumulative probability reaches the threshold $P$ (e.g., $P = 0.9$ or 90%).
Assume we have the following probabilities:
- Word A: 60%
- Word B: 25%
- Word C: 10%
- Word D: 5%
With P = 0.9 (90%):
The model accumulates: A (60%) + B (25%) = 85% (not enough) -> adds C (10%) = 95% (exceeded 90%).
The model will only sample from [A, B, C]. Word D is discarded completely.
The beauty of Top-P lies in its adaptability: if the top word has an extremely high probability (e.g., the first word already occupies 95%), the model will restrict its choice to just that word. Conversely, if probabilities are evenly spread out, the selection pool expands automatically to encourage creative outputs.
Full picture
graph TD
Logits[Logits: Vocabulary raw scores] -->|1. Divide by Temperature| Scaled[Scaled Logits]
Scaled -->|2. Softmax Function| Prob[Vocabulary Probability Distribution]
Prob -->|3. Top-K Filter| FilterK[Limit to K highest probability options]
FilterK -->|4. Top-P Filter| FilterP[Limit to cumulative probability P]
FilterP -->|5. Random Sampling| Selection[Select 1 token and generate output]
style Logits fill:#1e293b,stroke:#475569,color:#cbd5e1
style Scaled fill:#1e293b,stroke:#475569,color:#cbd5e1
style Prob fill:#1e293b,stroke:#475569,color:#cbd5e1
style FilterK fill:#1e293b,stroke:#475569,color:#cbd5e1
style FilterP fill:#1e293b,stroke:#475569,color:#cbd5e1
style Selection fill:#1e293b,stroke:#475569,color:#cbd5e1
Takeaway
Controlling the creativity and predictability of an LLM does not occur deep within the neural network's weights, but is entirely determined by the probability filters applied during the final decoding stage. By tuning how the probability distribution is scaled (Temperature) and dynamically truncated (Top-K/Top-P), developers can steer a single static model between two operational extremes: precise and deterministic for strict logic/coding, or diverse and surprising for creative brainstorming.
Related Posts
Zoom-in: LLM
Predicting the next token at supercomputer scale. Zoom in on the actual mechanism driving AI intelligence.
Zoom-in: Rate Limiter
You send too many API requests, and the system responds with '429 Too Many Requests'. How does the Rate Limiter gatekeeper protect system resources?
Zoom-in: WebSocket
Your chat app updates instantly without reloading the page. How WebSocket breaks free from the one-way limits of HTTP.