All Posts

LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained

Why does ChatGPT sometimes write poetry and sometimes write code? It's all in the settings. We explain how to tune LLMs for creativity vs. precision.

Abstract AlgorithmsAbstract Algorithms
ยทยท16 min read
Cover Image for LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: Temperature, Top-p, and Top-k are three sampling controls that determine how "creative" or "deterministic" an LLM's output is. Temperature rescales the probability distribution; Top-k limits the candidate pool by count; Top-p limits it by cumulative probability. Understanding all three โ€” and when to combine them โ€” is essential for production LLM work.


๐Ÿ“– The Three Knobs Behind Every LLM Response

Think of sampling controls as three dials on a mixing board:

  • Temperature โ€” the creativity dial. Turn it up for surprising outputs; turn it down for safe, predictable ones. It reshapes the entire probability distribution before any other control runs.
  • Top-k โ€” the candidate hard-cap. Keeps only the k most likely tokens in play. Like saying "only the first five dishes on the menu" โ€” everything else is off the table.
  • Top-p โ€” the smart filter. Keeps the smallest set of tokens whose combined probability reaches p. Like saying "only dishes that together account for 90% of what you normally order" โ€” the set grows or shrinks based on the model's confidence.

These controls do not change what the model learned. They only change how it samples its next word from what it already knows. Understanding the difference between a model's weights (fixed) and its sampling behavior (configurable) is the first thing to get right.


๐Ÿ” The Basics: What Sampling Parameters Control

After an LLM finishes processing your prompt, it outputs a probability distribution over its entire vocabulary โ€” potentially 50,000+ tokens. Sampling parameters control how you draw a single token from that distribution.

The three parameters and their roles:

  • Temperature โ€” scales the probability distribution. Low values concentrate probability on the top tokens; high values spread it out.
  • Top-k โ€” restricts sampling to the k most probable tokens. Hard cutoff by count.
  • Top-p (nucleus sampling) โ€” restricts sampling to the smallest set of tokens whose cumulative probability reaches p. Dynamic cutoff by probability mass.

These controls are not model weights โ€” they don't change what the model learned. They change how you sample from its output distribution at inference time. The same model, same prompt, same weights produces dramatically different outputs depending on these settings.

Default behavior without these controls:

ConfigurationResultTypical symptom
Temperature = 0Always pick the most probable tokenRepetitive, deterministic output
Temperature = 1 + no Top-k/Top-pUnconstrained samplingOccasionally incoherent
Temperature = 2Near-uniform samplingRandom word salad

The art of LLM tuning is calibrating these parameters to the task โ€” maximizing quality without sacrificing reliability.


๐ŸŒก๏ธ Temperature: Sharpening or Flattening the Probability Curve

After the model computes raw scores (logits) for each candidate token, it applies a softmax to get probabilities. Temperature $T$ scales the logits before softmax:

$$P_i = rac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

TemperatureEffectTypical use
T = 0.0Always the most probable token (greedy)Math, code, factual Q&A
T = 0.1โ€“0.4Focused, near-deterministicStructured extraction, SQL generation
T = 0.7โ€“0.9Balanced creativityGeneral chat, summarization
T = 1.0Raw distribution, no scalingExperimentation baseline
T > 1.0Flattened distribution โ€” more randomBrainstorming, poetry

Concrete example โ€” logits for three tokens: Mat=2.0, Rug=1.5, Moon=0.2

TP(Mat)P(Rug)P(Moon)
0.5~0.73~0.24~0.03
1.0~0.52~0.35~0.13
2.0~0.42~0.38~0.20

At T=0.5, "Mat" wins 73% of the time โ€” focused. At T=2.0, "Moon" steals 20% โ€” surprising.


๐ŸŽฏ Top-k: Hard Cap on the Candidate Pool

Top-k restricts sampling to only the k highest-probability next tokens. After Temperature scaling, only the top k tokens are kept; all others are set to zero probability (and renormalized).

All tokens: [Mat=0.52, Rug=0.35, Floor=0.09, Moon=0.04, ...]
Top-k = 2:  [Mat=0.52, Rug=0.35] โ†’ renormalized โ†’ [Mat=0.60, Rug=0.40]
Moon is impossible.
Top-k valueEffect
k = 1Greedy decoding โ€” always the most probable token
k = 10Small focused vocabulary, low surprise
k = 40โ€“50Standard for chat models
k = vocabulary sizeNo restriction โ€” same as not using Top-k

Trade-off: Top-k is a static threshold. If the distribution is wide (many roughly equal options), k=10 may still allow incoherent tokens. If the distribution is narrow (one dominant token), k=50 may allow long-tail noise.


๐Ÿ”ต Top-p (Nucleus Sampling): Dynamic Candidate Pool by Mass

Top-p is more adaptive โ€” it includes the smallest set of tokens whose cumulative probability reaches p. The key word is cumulative: the candidate set size changes dynamically.

Sorted: [Mat=0.50, Rug=0.30, Floor=0.10, Moon=0.05, Star=0.03, ...]
Top-p = 0.9:
  Cumulative after Mat:   0.50  (< 0.9, continue)
  Cumulative after Rug:   0.80  (< 0.9, continue)
  Cumulative after Floor: 0.90  (= 0.9, stop)
Nucleus: {Mat, Rug, Floor}

If the model is very confident, nucleus may be just 1โ€“2 tokens. If uncertain, it may include 50+.

graph LR
    A[Logits] --> B[Apply Temperature]
    B --> C[Sort by probability]
    C --> D{Top-k cutoff}
    D --> E{Top-p cumulative check}
    E --> F[Renormalize remaining tokens]
    F --> G[Sample final token]

๐Ÿ“Š How Token Sampling Flows: The Complete Pipeline

flowchart TD
    A[LLM produces raw logits for all tokens] --> B[Apply Temperature scaling]
    B --> C{Top-k enabled?}
    C -->|Yes| D[Keep top k tokens, zero rest]
    C -->|No| E[Keep all tokens]
    D --> F{Top-p enabled?}
    E --> F
    F -->|Yes| G[Keep smallest set with cumulative prob >= p]
    F -->|No| H[Keep current token set]
    G --> I[Renormalize probabilities to sum to 1]
    H --> I
    I --> J[Sample one token from renormalized distribution]
    J --> K[Append token to output]
    K --> L{End of sequence?}
    L -->|No| A
    L -->|Yes| M[Return complete response]

Key observation: Temperature, Top-k, and Top-p are applied in sequence, not simultaneously. Temperature reshapes the entire distribution first; Top-k then hard-caps the candidates; Top-p further filters by cumulative probability. Each step makes the sampling more constrained.

When controls are most impactful:

Model confidenceWithout controlsWith Top-p = 0.9
High (one clear best token)Fine โ€” little differenceVery few candidates
Medium (several good options)Good varietyControlled variety
Low (uniform distribution)Random noiseTrimmed to high-mass tokens only

โš™๏ธ How Temperature, Top-k, and Top-p Interact

These three controls are typically applied in sequence: Temperature โ†’ Top-k โ†’ Top-p.

# OpenAI API โ€” all three at once
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about databases."}],
    temperature=0.8,
    top_p=0.95,
    # Note: OpenAI API doesn't expose top_k directly; Anthropic and local models do
)

Interaction gotcha: Setting both Top-k and Top-p is redundant in most cases โ€” whichever is more restrictive wins. Most practitioners use either temperature + top-p OR temperature alone (at T=0 for deterministic tasks).

๐Ÿ“Š Sampling Strategy Comparison

flowchart LR
    LG[Logits] --> TS[Temperature Scale]
    TS --> TK[Top-K Filter]
    TK --> TP[Top-P Nucleus]
    TP --> SM[Softmax]
    SM --> SMP[Sample Token]

๐Ÿง  Deep Dive: How Temperature Reshapes the Logit Distribution

Temperature divides raw logit scores before Softmax. Dividing by T < 1 sharpens differences โ€” the highest-logit token dominates further. Dividing by T > 1 flattens them โ€” low-probability tokens gain share. The math: P_i = exp(z_i/T) / ฮฃ exp(z_j/T). This is applied pre-normalization, giving maximum control over distribution shape. The model's weights never change โ€” only the sampling distribution at inference time shifts.


๐Ÿ”ฌ Internals

Temperature T rescales logits before softmax: p_i = exp(z_i/T) / ฮฃ exp(z_j/T). As Tโ†’0 the distribution collapses to argmax (greedy); as Tโ†’โˆž it becomes uniform. Top-p (nucleus) sampling first sorts tokens by probability, then truncates the tail such that the cumulative probability reaches p โ€” the effective vocabulary shrinks dynamically with each token.

โšก Performance Analysis

Temperature and sampling add negligible latency (<1 ms per token) since they operate on logit vectors post-inference. However, high temperature with large top-k (e.g., k=200) increases output variance: in creative tasks this raises diversity by ~40% while increasing factual error rate by ~15โ€“25%. For RAG and tool-calling, T=0 reduces hallucination rate by 30โ€“50% compared to T=0.7.

๐ŸŒ Real-World Applications of Sampling Control

Sampling parameters are not set-and-forget. Different applications need different profiles:

Customer support chatbots: Use T = 0.3โ€“0.5 for consistent, professional responses. Add Top-p = 0.9 to avoid the rare incoherent outputs that a T = 0.3 with wide vocabulary can produce. Consistency matters more than creativity.

Code generation assistants: Use T = 0.0 or T = 0.1. Code has right and wrong answers. Any randomness above 0.2 increases the chance of syntactically valid but semantically incorrect suggestions. Most production copilots use deterministic sampling for function bodies.

Creative writing tools: Use T = 0.9โ€“1.1 with Top-p = 0.95 for maximum creative variance. The high temperature ensures surprising word choices; the Top-p cap prevents complete incoherence.

Summarization pipelines: Use T = 0.5 with Top-k = 40. Summarization needs coherent output but not creativity. The Top-k cap keeps the vocabulary tight while allowing some variation across runs.

RAG (retrieval-augmented generation) systems: Use T = 0.2โ€“0.3. The retrieved context is doing the creative heavy lifting; the model's job is accurate extraction and synthesis, not generation. High temperature directly increases hallucination risk in this setting.

A/B testing tip: When testing prompt changes, fix your sampling settings. Variable temperature makes it impossible to isolate whether quality changes are from your prompt or from sampling variance.


๐Ÿงช Practical Sampling Settings by Task

This reference table consolidates the recommended Temperature, Top-p, and Top-k settings for seven common LLM task types in a single scannable view. It was chosen because real applications rarely serve just one task type โ€” a production endpoint might handle code generation, factual Q&A, and creative writing through the same model, and each task demands a fundamentally different sampling regime. Read the Why column first: it explains the reasoning behind each combination, which is more durable knowledge than memorising specific numbers.

TaskTemperatureTop-pTop-kWhy
Code generation0.1โ€”โ€”Correctness over creativity
SQL query0.0โ€”โ€”Deterministic, one valid answer
Factual Q&A0.20.1โ€”Focused but not totally greedy
General chat0.70.9โ€”Natural, fluent, slightly varied
Summarization0.5โ€”40Coherent, limited word choice
Creative writing0.90.95โ€”High creative variance
Brainstorming1.10.99โ€”Maximum divergence

The hallucination risk rule: High temperature increases creativity AND hallucination in strict proportion. For grounded tasks (factual, code), always prefer T < 0.3.


โš–๏ธ Trade-offs & Failure Modes: Failure Modes and Mitigation

FailureRoot settingSymptomFix
Repetition loopT too low (< 0.1)"I am, I am, I am..."Use repetition_penalty or raise T slightly
HallucinationT too highConfident false factsLower T; add RAG grounding
IncoherenceT > 1.5Disconnected word saladLower T
Boring outputT too low + Top-k too smallFormulaic, repetitive textRaise T or Top-p
Wrong schemaT any valueStructured output fails parseUse T=0 + structured output mode

๐Ÿงญ Decision Guide: Choosing Sampling Settings by Task

Use the table below as a quick reference when you first approach a new task and don't yet know your ideal settings. The seven-row task table in the previous section gives finer-grained per-task detail; this guide focuses on the broader decision rule per scenario.

Task typeTemperatureTop-pRule of thumb
Factual Q&A / SQL0.0โ€“0.2โ€”Determinism beats creativity
Code generation0.1โ€”One correct answer exists
General chat0.70.9Balance fluency and variety
Creative writing0.9โ€“1.10.95Maximize diversity
RAG / grounded tasksโ‰ค 0.3โ€”Low T reduces hallucination risk

When unsure, start at T = 0.7 and adjust: raise if outputs feel repetitive, lower if they become erratic or hallucinate.

๐Ÿ“Š Hyperparameter Selection Guide

flowchart TD
    TK[Task Type] --> CR{Creative task?}
    CR -- Yes --> HT[High temp 0.8-1.2]
    CR -- No --> FC{Factual task?}
    FC -- Yes --> LT[Low temp 0.1-0.3]
    FC -- No --> CD{Code task?}
    CD -- Yes --> TP[top-p equals 0.9]
    CD -- No --> MT[Medium temp 0.5]

๐ŸŽฏ What to Learn Next


๐Ÿ› ๏ธ HuggingFace Transformers: Setting Temperature, Top-p, and Top-k in GenerationConfig

HuggingFace Transformers is the open-source Python library that provides GenerationConfig โ€” the canonical object for setting temperature, top-p, top-k, and every other sampling parameter discussed in this post. It is used by researchers, product engineers, and inference servers (vLLM, TGI) alike, making it the universal reference for understanding how sampling parameters translate into actual model behavior.

The GenerationConfig API solves the Lesson 5 problem from this post (log your sampling parameters with every production request): it is a serializable, version-controllable config object that can be saved to disk, loaded from a checkpoint, and attached to any request log for exact reproducibility.

# pip install transformers accelerate torch

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

model_name = "gpt2"  # swap for "mistralai/Mistral-7B-v0.1" etc. with sufficient VRAM
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "The most reliable database for financial transactions is"
inputs = tokenizer(prompt, return_tensors="pt")

# โ”€โ”€ Preset 1: Deterministic (code / SQL / factual Q&A) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
cfg_deterministic = GenerationConfig(
    max_new_tokens=40,
    temperature=0.1,   # near-greedy: almost always picks the highest-prob token
    top_p=1.0,         # no nucleus filtering โ€” temperature alone drives focus
    top_k=0,           # 0 = disabled
    do_sample=True,
)

# โ”€โ”€ Preset 2: Balanced chat (general-purpose assistant) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
cfg_chat = GenerationConfig(
    max_new_tokens=40,
    temperature=0.7,   # fluent and varied without being erratic
    top_p=0.9,         # nucleus: keep tokens covering 90% cumulative probability
    top_k=0,           # top_p alone is sufficient โ€” using both is redundant here
    do_sample=True,
)

# โ”€โ”€ Preset 3: Creative writing / brainstorming โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
cfg_creative = GenerationConfig(
    max_new_tokens=40,
    temperature=1.1,   # flatten distribution โ†’ more surprising word choices
    top_p=0.95,        # still discard the lowest-probability long tail
    top_k=0,
    do_sample=True,
)

# โ”€โ”€ Run all three presets on the same prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
with torch.no_grad():
    for name, cfg in [("deterministic", cfg_deterministic),
                      ("chat",          cfg_chat),
                      ("creative",      cfg_creative)]:
        out = model.generate(**inputs, generation_config=cfg)
        text = tokenizer.decode(out[0], skip_special_tokens=True)
        print(f"\n[{name}] T={cfg.temperature}, top_p={cfg.top_p}:")
        print(text)

# โ”€โ”€ Save and reload GenerationConfig (for audit / reproducibility) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
cfg_chat.save_pretrained("./my_chat_config")   # writes generation_config.json
cfg_loaded = GenerationConfig.from_pretrained("./my_chat_config")
print("\nReloaded temperature:", cfg_loaded.temperature)  # โ†’ 0.7

# โ”€โ”€ Measuring hallucination risk proxy: entropy of the next-token distribution โ”€
# High entropy โ‰ˆ model is uncertain โ‰ˆ higher hallucination risk at high temperature
with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]   # logits for the NEXT token
    probs  = torch.softmax(logits / 0.7, dim=-1)  # apply T=0.7
    entropy = -(probs * probs.log()).sum().item()
    print(f"\nNext-token entropy at T=0.7: {entropy:.2f} nats")
    # Higher entropy โ†’ more uncertain โ†’ more creative but higher hallucination risk

GenerationConfig.save_pretrained() serializes all sampling parameters to generation_config.json โ€” attach this file to your deployment artifact so every model version ships with its documented, tested sampling settings. This directly addresses Lesson 5 from this post.

For a full deep-dive on HuggingFace GenerationConfig and inference optimization, a dedicated follow-up post is planned.


๐Ÿ“š Lessons from Production LLM Systems

Lesson 1: Temperature is the single largest driver of hallucination risk. Every point you raise temperature above 0.3 increases the probability of confidently stated but factually incorrect output. For customer-facing, grounded, or safety-critical applications, keep temperature low and use RAG or tool use for factual grounding.

Lesson 2: Default API settings are tuned for demo purposes, not production. Most LLM APIs default to T = 0.7 or T = 1.0 because that feels responsive in a playground. Production applications almost always need different settings. Profile your specific task before using defaults.

Lesson 3: Top-p is almost always better than Top-k for text generation. Top-k uses a static count; on a narrow distribution it still allows long-tail noise, and on a flat distribution it may be too restrictive. Top-p adapts dynamically to the model's confidence. Use Top-k only when you want strict vocabulary control (e.g., constrained slot filling).

Lesson 4: Structured output mode makes sampling less relevant. Modern APIs (GPT-4o, Claude 3.5) support JSON schema enforcement at the decoding level. When you use structured output mode, the model is forced to emit valid JSON matching your schema regardless of temperature. For structured tasks, enable this mode and let temperature handle only the content variation, not the format.

Lesson 5: Log your sampling parameters with every production request. When debugging unexpected outputs, you need to know the exact temperature and Top-p used. Include these in your request metadata logs. A subtle shift in deployed configuration โ€” even from T = 0.3 to T = 0.5 โ€” can meaningfully change output quality at scale.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Temperature scales logit differences โ€” lower = more deterministic, higher = more random.
  • Top-k hard-caps the candidate set at k tokens; Top-p soft-caps it by cumulative probability mass.
  • Top-p is more adaptive than Top-k โ€” it adjusts dynamically to the model's confidence level.
  • For production grounded tasks (code, SQL, factual), use T=0 or Tโ‰ค0.2.
  • High temperature is the single largest driver of hallucination โ€” guard it closely in customer-facing systems.

๐Ÿ“ Practice Quiz

  1. What happens to the probability distribution when Temperature is set below 1.0?

    • A) Probabilities become uniform โ€” all tokens equally likely
    • B) Differences between token probabilities are amplified โ€” the most likely token becomes even more dominant
    • C) The model rejects low-probability tokens entirely
    • D) The model generates shorter responses

    Correct Answer: B โ€” dividing logits by T < 1 sharpens the distribution, making high-probability tokens even more likely and compressing the tail.

  2. Why is Top-p generally preferred over Top-k for dynamic distributions?

    • A) Top-p is always faster to compute
    • B) Top-p adapts the candidate pool size to the model's confidence โ€” narrow when certain, wider when uncertain
    • C) Top-p eliminates hallucination completely
    • D) Top-p is required by all major APIs

    Correct Answer: B โ€” Top-k always keeps exactly k tokens regardless of distribution shape; Top-p adjusts automatically, keeping 2 tokens when the model is confident and 50+ when it's uncertain.

  3. A customer-facing chatbot is returning confidently incorrect medical information. Which parameter change is most likely to help?

    • A) Increase temperature for more diverse outputs
    • B) Decrease temperature and add RAG grounding to keep outputs factually anchored
    • C) Set Top-k to 1
    • D) Set Top-p to 0.99

    Correct Answer: B โ€” high temperature increases hallucination proportionally; lowering it reduces creative variance and RAG grounds responses in retrieved facts.

  4. Open-ended challenge: You are building a customer support chatbot that must be helpful and creative in tone but never hallucinate policy details. Design a decoding strategy using temperature, top-p, and any other parameters โ€” justify each choice and explain how you would evaluate whether the balance is right.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms