Understanding temperature and top_p in LLMs

TL;DR

  • temperature defines the probability function → Adjusts randomness of selection.
  • top_p decides the words before the probability function runs → It removes low-probability words dynamically.
  • Best Practice:
    • temperature = 0.3, top_p = 0.5 → Structured, factual responses.
    • temperature = 0.7, top_p = 0.9 → Balanced, informative output.
    • temperature = 1.2, top_p = 0.95 → Creative storytelling.
    • temperature = 1.5, top_p = 1.0 → Highly imaginative, fun outputs.
  • Analogy:
    • top_p = The shortlist (who gets to enter the competition).
    • temperature = How fairly we pick a winner (rigid or more open to randomness).

1. What is temperature?

temperature controls the randomness of text generation by adjusting the probability distribution of the next token.

  • Higher temperature (> 1.0) → More randomness, diverse, and creative outputs.
  • Lower temperature (< 0.5) → More deterministic and predictable responses.
  • Extreme cases:
    • T = 0.0 → Always picks the most probable token (fully deterministic).
    • T > 1.5 → Highly random, may produce nonsensical output.

Mathematically, it scales the logits (raw probabilities) before applying softmax, making the probability distribution either sharper or flatter.

2. What is top_p?

top_p (nucleus sampling) dynamically selects a subset of tokens based on cumulative probability mass.

  • Lower top_p (e.g., 0.3-0.5) → Only considers the most probable words.
  • Higher top_p (e.g., 0.9-1.0) → Allows for more diverse token choices.
  • top_p = 1.0 → No filtering, includes all words.

Instead of picking the top-K tokens (top_k), top_p dynamically decides how many tokens to consider based on probability mass.

3. Key Difference: Probability Function vs. Word Filtering

  • temperature defines the probability function → Adjusts randomness of selection.
  • top_p decides the words before the probability function runs → It removes low-probability words dynamically.

4. Example to Illustrate the Difference

Prompt: “The cat sat on the…”

Changing temperature

  • T = 0.3 → “mat.”
  • T = 1.0 → “soft cushion, gazing at the moon.”
  • T = 1.5 → “edge of a time portal, ready to explore a new universe.”

âś… Temperature affects the randomness of chosen words.

Changing top_p

  • top_p = 0.3 → “mat.”
  • top_p = 0.9 → “windowsill, basking in the warm sun.”
  • top_p = 1.0 → “flamingo’s back, dreaming of intergalactic adventures.”

âś… Top-p limits the set of words dynamically before applying probabilities.

5. How top_p and temperature Work Together

  1. First, top_p filters words (removing unlikely ones).
  2. Then, temperature controls how randomly we pick among them.

6. Best Practices for Use Cases

Use CaseRecommended Settings
Factual, structured responsestemperature = 0.3, top_p = 0.5
Balanced, informative outputtemperature = 0.7, top_p = 0.9
Creative storytellingtemperature = 1.2, top_p = 0.95
Highly imaginative, fun outputstemperature = 1.5, top_p = 1.0