LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained
Why does ChatGPT sometimes write poetry and sometimes write code? It's all in the settings. We explain how to tune LLMs for creativity vs. precision.
Abstract Algorithms
TLDR: Temperature, Top-p, and Top-k are three sampling controls that determine how "creative" or "deterministic" an LLM's output is. Temperature rescales the probability distribution; Top-k limits the candidate pool by count; Top-p limits it by cumulative probability. Understanding all three โ and when to combine them โ is essential for production LLM work.
๐ The Three Knobs Behind Every LLM Response
Think of sampling controls as three dials on a mixing board:
- Temperature โ the creativity dial. Turn it up for surprising outputs; turn it down for safe, predictable ones. It reshapes the entire probability distribution before any other control runs.
- Top-k โ the candidate hard-cap. Keeps only the k most likely tokens in play. Like saying "only the first five dishes on the menu" โ everything else is off the table.
- Top-p โ the smart filter. Keeps the smallest set of tokens whose combined probability reaches p. Like saying "only dishes that together account for 90% of what you normally order" โ the set grows or shrinks based on the model's confidence.
These controls do not change what the model learned. They only change how it samples its next word from what it already knows. Understanding the difference between a model's weights (fixed) and its sampling behavior (configurable) is the first thing to get right.
๐ The Basics: What Sampling Parameters Control
After an LLM finishes processing your prompt, it outputs a probability distribution over its entire vocabulary โ potentially 50,000+ tokens. Sampling parameters control how you draw a single token from that distribution.
The three parameters and their roles:
- Temperature โ scales the probability distribution. Low values concentrate probability on the top tokens; high values spread it out.
- Top-k โ restricts sampling to the k most probable tokens. Hard cutoff by count.
- Top-p (nucleus sampling) โ restricts sampling to the smallest set of tokens whose cumulative probability reaches p. Dynamic cutoff by probability mass.
These controls are not model weights โ they don't change what the model learned. They change how you sample from its output distribution at inference time. The same model, same prompt, same weights produces dramatically different outputs depending on these settings.
Default behavior without these controls:
| Configuration | Result | Typical symptom |
| Temperature = 0 | Always pick the most probable token | Repetitive, deterministic output |
| Temperature = 1 + no Top-k/Top-p | Unconstrained sampling | Occasionally incoherent |
| Temperature = 2 | Near-uniform sampling | Random word salad |
The art of LLM tuning is calibrating these parameters to the task โ maximizing quality without sacrificing reliability.
๐ก๏ธ Temperature: Sharpening or Flattening the Probability Curve
After the model computes raw scores (logits) for each candidate token, it applies a softmax to get probabilities. Temperature $T$ scales the logits before softmax:
$$P_i = rac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
| Temperature | Effect | Typical use |
T = 0.0 | Always the most probable token (greedy) | Math, code, factual Q&A |
T = 0.1โ0.4 | Focused, near-deterministic | Structured extraction, SQL generation |
T = 0.7โ0.9 | Balanced creativity | General chat, summarization |
T = 1.0 | Raw distribution, no scaling | Experimentation baseline |
T > 1.0 | Flattened distribution โ more random | Brainstorming, poetry |
Concrete example โ logits for three tokens: Mat=2.0, Rug=1.5, Moon=0.2
| T | P(Mat) | P(Rug) | P(Moon) |
| 0.5 | ~0.73 | ~0.24 | ~0.03 |
| 1.0 | ~0.52 | ~0.35 | ~0.13 |
| 2.0 | ~0.42 | ~0.38 | ~0.20 |
At T=0.5, "Mat" wins 73% of the time โ focused. At T=2.0, "Moon" steals 20% โ surprising.
๐ฏ Top-k: Hard Cap on the Candidate Pool
Top-k restricts sampling to only the k highest-probability next tokens. After Temperature scaling, only the top k tokens are kept; all others are set to zero probability (and renormalized).
All tokens: [Mat=0.52, Rug=0.35, Floor=0.09, Moon=0.04, ...]
Top-k = 2: [Mat=0.52, Rug=0.35] โ renormalized โ [Mat=0.60, Rug=0.40]
Moon is impossible.
| Top-k value | Effect |
| k = 1 | Greedy decoding โ always the most probable token |
| k = 10 | Small focused vocabulary, low surprise |
| k = 40โ50 | Standard for chat models |
| k = vocabulary size | No restriction โ same as not using Top-k |
Trade-off: Top-k is a static threshold. If the distribution is wide (many roughly equal options), k=10 may still allow incoherent tokens. If the distribution is narrow (one dominant token), k=50 may allow long-tail noise.
๐ต Top-p (Nucleus Sampling): Dynamic Candidate Pool by Mass
Top-p is more adaptive โ it includes the smallest set of tokens whose cumulative probability reaches p. The key word is cumulative: the candidate set size changes dynamically.
Sorted: [Mat=0.50, Rug=0.30, Floor=0.10, Moon=0.05, Star=0.03, ...]
Top-p = 0.9:
Cumulative after Mat: 0.50 (< 0.9, continue)
Cumulative after Rug: 0.80 (< 0.9, continue)
Cumulative after Floor: 0.90 (= 0.9, stop)
Nucleus: {Mat, Rug, Floor}
If the model is very confident, nucleus may be just 1โ2 tokens. If uncertain, it may include 50+.
graph LR
A[Logits] --> B[Apply Temperature]
B --> C[Sort by probability]
C --> D{Top-k cutoff}
D --> E{Top-p cumulative check}
E --> F[Renormalize remaining tokens]
F --> G[Sample final token]
๐ How Token Sampling Flows: The Complete Pipeline
flowchart TD
A[LLM produces raw logits for all tokens] --> B[Apply Temperature scaling]
B --> C{Top-k enabled?}
C -->|Yes| D[Keep top k tokens, zero rest]
C -->|No| E[Keep all tokens]
D --> F{Top-p enabled?}
E --> F
F -->|Yes| G[Keep smallest set with cumulative prob >= p]
F -->|No| H[Keep current token set]
G --> I[Renormalize probabilities to sum to 1]
H --> I
I --> J[Sample one token from renormalized distribution]
J --> K[Append token to output]
K --> L{End of sequence?}
L -->|No| A
L -->|Yes| M[Return complete response]
Key observation: Temperature, Top-k, and Top-p are applied in sequence, not simultaneously. Temperature reshapes the entire distribution first; Top-k then hard-caps the candidates; Top-p further filters by cumulative probability. Each step makes the sampling more constrained.
When controls are most impactful:
| Model confidence | Without controls | With Top-p = 0.9 |
| High (one clear best token) | Fine โ little difference | Very few candidates |
| Medium (several good options) | Good variety | Controlled variety |
| Low (uniform distribution) | Random noise | Trimmed to high-mass tokens only |
โ๏ธ How Temperature, Top-k, and Top-p Interact
These three controls are typically applied in sequence: Temperature โ Top-k โ Top-p.
# OpenAI API โ all three at once
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about databases."}],
temperature=0.8,
top_p=0.95,
# Note: OpenAI API doesn't expose top_k directly; Anthropic and local models do
)
Interaction gotcha: Setting both Top-k and Top-p is redundant in most cases โ whichever is more restrictive wins. Most practitioners use either temperature + top-p OR temperature alone (at T=0 for deterministic tasks).
๐ Sampling Strategy Comparison
flowchart LR
LG[Logits] --> TS[Temperature Scale]
TS --> TK[Top-K Filter]
TK --> TP[Top-P Nucleus]
TP --> SM[Softmax]
SM --> SMP[Sample Token]
๐ง Deep Dive: How Temperature Reshapes the Logit Distribution
Temperature divides raw logit scores before Softmax. Dividing by T < 1 sharpens differences โ the highest-logit token dominates further. Dividing by T > 1 flattens them โ low-probability tokens gain share. The math: P_i = exp(z_i/T) / ฮฃ exp(z_j/T). This is applied pre-normalization, giving maximum control over distribution shape. The model's weights never change โ only the sampling distribution at inference time shifts.
๐ฌ Internals
Temperature T rescales logits before softmax: p_i = exp(z_i/T) / ฮฃ exp(z_j/T). As Tโ0 the distribution collapses to argmax (greedy); as Tโโ it becomes uniform. Top-p (nucleus) sampling first sorts tokens by probability, then truncates the tail such that the cumulative probability reaches p โ the effective vocabulary shrinks dynamically with each token.
โก Performance Analysis
Temperature and sampling add negligible latency (<1 ms per token) since they operate on logit vectors post-inference. However, high temperature with large top-k (e.g., k=200) increases output variance: in creative tasks this raises diversity by ~40% while increasing factual error rate by ~15โ25%. For RAG and tool-calling, T=0 reduces hallucination rate by 30โ50% compared to T=0.7.
๐ Real-World Applications of Sampling Control
Sampling parameters are not set-and-forget. Different applications need different profiles:
Customer support chatbots: Use T = 0.3โ0.5 for consistent, professional responses. Add Top-p = 0.9 to avoid the rare incoherent outputs that a T = 0.3 with wide vocabulary can produce. Consistency matters more than creativity.
Code generation assistants: Use T = 0.0 or T = 0.1. Code has right and wrong answers. Any randomness above 0.2 increases the chance of syntactically valid but semantically incorrect suggestions. Most production copilots use deterministic sampling for function bodies.
Creative writing tools: Use T = 0.9โ1.1 with Top-p = 0.95 for maximum creative variance. The high temperature ensures surprising word choices; the Top-p cap prevents complete incoherence.
Summarization pipelines: Use T = 0.5 with Top-k = 40. Summarization needs coherent output but not creativity. The Top-k cap keeps the vocabulary tight while allowing some variation across runs.
RAG (retrieval-augmented generation) systems: Use T = 0.2โ0.3. The retrieved context is doing the creative heavy lifting; the model's job is accurate extraction and synthesis, not generation. High temperature directly increases hallucination risk in this setting.
A/B testing tip: When testing prompt changes, fix your sampling settings. Variable temperature makes it impossible to isolate whether quality changes are from your prompt or from sampling variance.
๐งช Practical Sampling Settings by Task
This reference table consolidates the recommended Temperature, Top-p, and Top-k settings for seven common LLM task types in a single scannable view. It was chosen because real applications rarely serve just one task type โ a production endpoint might handle code generation, factual Q&A, and creative writing through the same model, and each task demands a fundamentally different sampling regime. Read the Why column first: it explains the reasoning behind each combination, which is more durable knowledge than memorising specific numbers.
| Task | Temperature | Top-p | Top-k | Why |
| Code generation | 0.1 | โ | โ | Correctness over creativity |
| SQL query | 0.0 | โ | โ | Deterministic, one valid answer |
| Factual Q&A | 0.2 | 0.1 | โ | Focused but not totally greedy |
| General chat | 0.7 | 0.9 | โ | Natural, fluent, slightly varied |
| Summarization | 0.5 | โ | 40 | Coherent, limited word choice |
| Creative writing | 0.9 | 0.95 | โ | High creative variance |
| Brainstorming | 1.1 | 0.99 | โ | Maximum divergence |
The hallucination risk rule: High temperature increases creativity AND hallucination in strict proportion. For grounded tasks (factual, code), always prefer T < 0.3.
โ๏ธ Trade-offs & Failure Modes: Failure Modes and Mitigation
| Failure | Root setting | Symptom | Fix |
| Repetition loop | T too low (< 0.1) | "I am, I am, I am..." | Use repetition_penalty or raise T slightly |
| Hallucination | T too high | Confident false facts | Lower T; add RAG grounding |
| Incoherence | T > 1.5 | Disconnected word salad | Lower T |
| Boring output | T too low + Top-k too small | Formulaic, repetitive text | Raise T or Top-p |
| Wrong schema | T any value | Structured output fails parse | Use T=0 + structured output mode |
๐งญ Decision Guide: Choosing Sampling Settings by Task
Use the table below as a quick reference when you first approach a new task and don't yet know your ideal settings. The seven-row task table in the previous section gives finer-grained per-task detail; this guide focuses on the broader decision rule per scenario.
| Task type | Temperature | Top-p | Rule of thumb |
| Factual Q&A / SQL | 0.0โ0.2 | โ | Determinism beats creativity |
| Code generation | 0.1 | โ | One correct answer exists |
| General chat | 0.7 | 0.9 | Balance fluency and variety |
| Creative writing | 0.9โ1.1 | 0.95 | Maximize diversity |
| RAG / grounded tasks | โค 0.3 | โ | Low T reduces hallucination risk |
When unsure, start at T = 0.7 and adjust: raise if outputs feel repetitive, lower if they become erratic or hallucinate.
๐ Hyperparameter Selection Guide
flowchart TD
TK[Task Type] --> CR{Creative task?}
CR -- Yes --> HT[High temp 0.8-1.2]
CR -- No --> FC{Factual task?}
FC -- Yes --> LT[Low temp 0.1-0.3]
FC -- No --> CD{Code task?}
CD -- Yes --> TP[top-p equals 0.9]
CD -- No --> MT[Medium temp 0.5]
๐ฏ What to Learn Next
- LLM Terms You Should Know: A Helpful Glossary
- Mastering Prompt Templates: System, User, and Assistant Roles
- RAG Explained: How to Give Your LLM a Brain Upgrade
๐ ๏ธ HuggingFace Transformers: Setting Temperature, Top-p, and Top-k in GenerationConfig
HuggingFace Transformers is the open-source Python library that provides GenerationConfig โ the canonical object for setting temperature, top-p, top-k, and every other sampling parameter discussed in this post. It is used by researchers, product engineers, and inference servers (vLLM, TGI) alike, making it the universal reference for understanding how sampling parameters translate into actual model behavior.
The GenerationConfig API solves the Lesson 5 problem from this post (log your sampling parameters with every production request): it is a serializable, version-controllable config object that can be saved to disk, loaded from a checkpoint, and attached to any request log for exact reproducibility.
# pip install transformers accelerate torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
model_name = "gpt2" # swap for "mistralai/Mistral-7B-v0.1" etc. with sufficient VRAM
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "The most reliable database for financial transactions is"
inputs = tokenizer(prompt, return_tensors="pt")
# โโ Preset 1: Deterministic (code / SQL / factual Q&A) โโโโโโโโโโโโโโโโโโโโโโโ
cfg_deterministic = GenerationConfig(
max_new_tokens=40,
temperature=0.1, # near-greedy: almost always picks the highest-prob token
top_p=1.0, # no nucleus filtering โ temperature alone drives focus
top_k=0, # 0 = disabled
do_sample=True,
)
# โโ Preset 2: Balanced chat (general-purpose assistant) โโโโโโโโโโโโโโโโโโโโโโ
cfg_chat = GenerationConfig(
max_new_tokens=40,
temperature=0.7, # fluent and varied without being erratic
top_p=0.9, # nucleus: keep tokens covering 90% cumulative probability
top_k=0, # top_p alone is sufficient โ using both is redundant here
do_sample=True,
)
# โโ Preset 3: Creative writing / brainstorming โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
cfg_creative = GenerationConfig(
max_new_tokens=40,
temperature=1.1, # flatten distribution โ more surprising word choices
top_p=0.95, # still discard the lowest-probability long tail
top_k=0,
do_sample=True,
)
# โโ Run all three presets on the same prompt โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
with torch.no_grad():
for name, cfg in [("deterministic", cfg_deterministic),
("chat", cfg_chat),
("creative", cfg_creative)]:
out = model.generate(**inputs, generation_config=cfg)
text = tokenizer.decode(out[0], skip_special_tokens=True)
print(f"\n[{name}] T={cfg.temperature}, top_p={cfg.top_p}:")
print(text)
# โโ Save and reload GenerationConfig (for audit / reproducibility) โโโโโโโโโโโโ
cfg_chat.save_pretrained("./my_chat_config") # writes generation_config.json
cfg_loaded = GenerationConfig.from_pretrained("./my_chat_config")
print("\nReloaded temperature:", cfg_loaded.temperature) # โ 0.7
# โโ Measuring hallucination risk proxy: entropy of the next-token distribution โ
# High entropy โ model is uncertain โ higher hallucination risk at high temperature
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :] # logits for the NEXT token
probs = torch.softmax(logits / 0.7, dim=-1) # apply T=0.7
entropy = -(probs * probs.log()).sum().item()
print(f"\nNext-token entropy at T=0.7: {entropy:.2f} nats")
# Higher entropy โ more uncertain โ more creative but higher hallucination risk
GenerationConfig.save_pretrained() serializes all sampling parameters to generation_config.json โ attach this file to your deployment artifact so every model version ships with its documented, tested sampling settings. This directly addresses Lesson 5 from this post.
For a full deep-dive on HuggingFace GenerationConfig and inference optimization, a dedicated follow-up post is planned.
๐ Lessons from Production LLM Systems
Lesson 1: Temperature is the single largest driver of hallucination risk. Every point you raise temperature above 0.3 increases the probability of confidently stated but factually incorrect output. For customer-facing, grounded, or safety-critical applications, keep temperature low and use RAG or tool use for factual grounding.
Lesson 2: Default API settings are tuned for demo purposes, not production. Most LLM APIs default to T = 0.7 or T = 1.0 because that feels responsive in a playground. Production applications almost always need different settings. Profile your specific task before using defaults.
Lesson 3: Top-p is almost always better than Top-k for text generation. Top-k uses a static count; on a narrow distribution it still allows long-tail noise, and on a flat distribution it may be too restrictive. Top-p adapts dynamically to the model's confidence. Use Top-k only when you want strict vocabulary control (e.g., constrained slot filling).
Lesson 4: Structured output mode makes sampling less relevant. Modern APIs (GPT-4o, Claude 3.5) support JSON schema enforcement at the decoding level. When you use structured output mode, the model is forced to emit valid JSON matching your schema regardless of temperature. For structured tasks, enable this mode and let temperature handle only the content variation, not the format.
Lesson 5: Log your sampling parameters with every production request. When debugging unexpected outputs, you need to know the exact temperature and Top-p used. Include these in your request metadata logs. A subtle shift in deployed configuration โ even from T = 0.3 to T = 0.5 โ can meaningfully change output quality at scale.
๐ TLDR: Summary & Key Takeaways
- Temperature scales logit differences โ lower = more deterministic, higher = more random.
- Top-k hard-caps the candidate set at
ktokens; Top-p soft-caps it by cumulative probability mass. - Top-p is more adaptive than Top-k โ it adjusts dynamically to the model's confidence level.
- For production grounded tasks (code, SQL, factual), use T=0 or Tโค0.2.
- High temperature is the single largest driver of hallucination โ guard it closely in customer-facing systems.
๐ Practice Quiz
What happens to the probability distribution when Temperature is set below 1.0?
- A) Probabilities become uniform โ all tokens equally likely
- B) Differences between token probabilities are amplified โ the most likely token becomes even more dominant
- C) The model rejects low-probability tokens entirely
- D) The model generates shorter responses
Correct Answer: B โ dividing logits by T < 1 sharpens the distribution, making high-probability tokens even more likely and compressing the tail.
Why is Top-p generally preferred over Top-k for dynamic distributions?
- A) Top-p is always faster to compute
- B) Top-p adapts the candidate pool size to the model's confidence โ narrow when certain, wider when uncertain
- C) Top-p eliminates hallucination completely
- D) Top-p is required by all major APIs
Correct Answer: B โ Top-k always keeps exactly k tokens regardless of distribution shape; Top-p adjusts automatically, keeping 2 tokens when the model is confident and 50+ when it's uncertain.
A customer-facing chatbot is returning confidently incorrect medical information. Which parameter change is most likely to help?
- A) Increase temperature for more diverse outputs
- B) Decrease temperature and add RAG grounding to keep outputs factually anchored
- C) Set Top-k to 1
- D) Set Top-p to 0.99
Correct Answer: B โ high temperature increases hallucination proportionally; lowering it reduces creative variance and RAG grounds responses in retrieved facts.
- Open-ended challenge: You are building a customer support chatbot that must be helpful and creative in tone but never hallucinate policy details. Design a decoding strategy using temperature, top-p, and any other parameters โ justify each choice and explain how you would evaluate whether the balance is right.
๐ Related Posts
- LLM Terms You Should Know: A Helpful Glossary
- Mastering Prompt Templates: System, User, and Assistant Roles
- RAG Explained: How to Give Your LLM a Brain Upgrade

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions โ with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy โ but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose โ range, hash, consistent hashing, or directory โ determines whether range queries stay ch...
