Text Decoding Strategies: Greedy, Beam Search, and Sampling
How does an LLM choose the next word? It's not just random. We explore Greedy Search, Beam Search...
Abstract AlgorithmsTLDR: An LLM doesn't "write" text โ it generates a probability distribution over all possible next tokens and then uses a decoding strategy to pick one. Greedy, Beam Search, and Sampling are different rules for that choice. Temperature controls the creativity knob.
TLDR: Greedy decoding is fast but repetitive; beam search improves quality at compute cost; temperature + top-p sampling trades determinism for creativity โ choose based on your task.
๐ The "Finish the Sentence" Game
Imagine flashing the phrase "The dog jumps over the ___" to a crowd:
- 60% shout "Fence"
- 20% shout "Cat"
- 15% shout "River"
- 5% shout "Moon"
The LLM does this for every single token. The question is: do you always pick "Fence" (greedy), do you explore multiple branches (beam search), or do you randomly sample from the crowd (sampling)?
๐ Core Concepts: Tokens, Logits, and Probability Distributions
Before diving into how decoding strategies differ, it helps to understand exactly what is happening inside the model at each generation step.
What is a token? A token is not the same as a word. Modern LLMs use sub-word tokenization โ common words become single tokens, but rarer or longer words get split into smaller pieces. For example, "unbelievable" might be tokenized as ["un", "believ", "able"]. GPT-series models use a vocabulary of roughly 50,000 tokens. Every time the model generates text, it is choosing the next item from this ~50,000-item menu.
What are logits? After processing the input through all its transformer layers, the model produces one raw score โ called a logit โ for every token in the vocabulary. These are unnormalized numbers: a logit of 5.0 for "fence" and 2.0 for "cat" tells you the model leans toward "fence," but by how much depends on all 50,000 scores together.
What does softmax do? The softmax function converts the entire logit vector into a valid probability distribution โ all values become positive and they sum to exactly 1.0. Only after softmax do we have probabilities we can actually sample from.
| Concept | Plain-English Meaning | Why It Matters |
| Token | Sub-word unit; ~50 k in vocabulary | Decoding operates token-by-token |
| Logit | Raw unnormalized score from the model's last layer | Higher logit = more likely token |
| Softmax | Converts logits โ valid probabilities summing to 1.0 | Enables probabilistic selection |
| Decoding strategy | Rule for picking the next token from the distribution | Controls quality, diversity, speed |
The key insight: the LLM doesn't "write" text like a human does. It iterates โ take the current context, pass it through the model, get a probability distribution over all 50,000 tokens, pick one token according to the decoding strategy, append it to the context, and repeat. That choice rule โ the decoding strategy โ has an enormous impact on output quality, diversity, and coherence.
Why does the same prompt give different answers? Unless you use pure greedy decoding (temperature T=0), the model makes a probabilistic draw at every token position, so two runs of the same prompt produce different outputs. This is intentional for creative tasks and something to disable for deterministic ones like SQL generation.
๐ข The Three Core Strategies
1. Greedy Decoding
Always pick the single highest-probability token at each step.
Input: "The dog jumps over the"
Step 1: "fence" (highest probability token)
Output: "The dog jumps over the fence."
Pros: Fast, deterministic, reproducible. Cons: Locally optimal choices lead to globally suboptimal sequences. The model can paint itself into a corner โ e.g., choosing "fence" first makes "and runs away" the natural continuation, even if "cat and runs away together" would have scored higher as a complete sentence.
2. Beam Search
Keep the top-K complete sequences (beams) in parallel. At each step, expand all K beams, score them, and keep only the top K.
K=3 beams after step 1:
Beam 1: "... fence" (log-prob: -0.5)
Beam 2: "... cat" (log-prob: -1.2)
Beam 3: "... river" (log-prob: -1.8)
After step 2 (expand each beam):
Beam 1a: "... fence and ..." (-0.5 + -0.3)
Beam 1b: "... fence to ..." (-0.5 + -0.6)
Beam 2a: "... cat and ..." (-1.2 + -0.2) โ might beat 1b!
...
Pros: Better global sequence scores than greedy. Cons: Computationally Kร more expensive. Beam outputs tend to be generic/safe โ less creative.
3. Sampling
Draw a token randomly according to the probability distribution โ not always the top one.
Pros: Diverse, creative outputs. Reduces repetition. Cons: Can produce incoherent text if the probability distribution is too flat.
Top-K Sampling
Only sample from the K highest-probability tokens. Prevents picking absurdly rare tokens.
Top-P (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability โฅ P.
P=0.9: include tokens until you've covered 90% of probability mass
If vocab: [fence: 60%, cat: 20%, river: 15%, moon: 5%]
Top-P=0.9 set: [fence(60%), cat(20%), river(15%)] โ sample from these 3
| Strategy | Diversity | Quality | Use Case |
| Greedy | None | High | Factual Q&A, code generation |
| Beam Search (K=5) | Low | High | Translation, summarization |
| Top-K Sampling | Medium | Medium | Creative writing |
| Top-P Sampling | High | Medium-High | Dialogue, creative tasks |
๐ Beam Search Step-by-Step Sequence
sequenceDiagram
participant M as Model
participant B as Beam Buffer
M->>B: Position 1: generate top-K tokens
B->>B: Keep K=3 beams with scores
M->>B: Position 2: expand each beam
B->>B: Score all Kยฒ candidates
B->>B: Prune: keep top-K by cumulative score
M->>B: Position 3: expand K beams again
B->>B: Prune to top-K
B->>M: Return highest-score complete sequence
This sequence diagram illustrates how beam search avoids the single-path trap of greedy decoding. At each position the Model generates tokens, the Beam Buffer scores all Kยฒ candidate continuations and prunes back to K beams โ so even if the initially most likely token leads to a poor continuation, a lower-probability first choice can still survive if its overall cumulative score is competitive. The takeaway: beam search finds globally better sequences, but at Kร the compute cost and with a tendency toward safe, generic output.
โ๏ธ Trade-offs & Failure Modes: When Strategies Go Wrong
Each strategy has failure patterns worth knowing before you deploy:
| Mistake | Strategy | Problem | Fix |
| Greedy for long creative text | Greedy | Repetitive loops, "dead ends" | Switch to Top-P sampling |
| High temperature for SQL/code | T > 1.0 | Invalid tokens inserted | Use T = 0 with structured schema |
| Beam search for chat responses | Beam K=5 | Generic, unnatural dialogue | Use Top-P T=0.7 instead |
| Top-P too narrow (P=0.1) for stories | Sampling | Boring, repetitive output | Raise P to 0.85โ0.95 |
| Temperature set once and never tuned | Any | Wrong quality/creativity balance | Tune per task type |
โ๏ธ Temperature: The Creativity Dial
Temperature $T$ reshapes the probability distribution before sampling:
$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
Where $z_i$ are the raw logit scores.
| Temperature | Effect | Output Character |
| $T = 0$ | Greedy (argmax) โ one token wins entirely | Deterministic, repetitive |
| $T = 1.0$ | Standard distribution (no change) | Balanced |
| $T < 1$ (e.g., 0.3) | Sharper distribution โ top token dominates even more | Focused, precise |
| $T > 1$ (e.g., 1.8) | Flatter distribution โ rare tokens become more likely | Creative, potentially incoherent |
Numerical example at $T = 0.5$ vs $T = 2.0$ for two tokens A (logit 2.0) and B (logit 1.0):
| Temperature | P(A) | P(B) | Interpretation |
| 1.0 | 73% | 27% | Standard |
| 0.5 | 98% | 2% | Very confident in A |
| 2.0 | 62% | 38% | B nearly as likely as A |
flowchart LR
Logits["Raw Logits z_i"] --> Temp["Divide by T"]
Temp --> Softmax["Softmax โ Probabilities"]
Softmax --> Decode["Decoding Strategy (greedy / top-k / nucleus)"]
Decode --> Token["Selected Token"]
This flowchart shows the three-step temperature pipeline: raw logits are divided by T (sharpening or flattening the distribution), softmax converts them into valid probabilities, and the chosen decoding strategy selects a token. Temperature is applied before sampling, which means it amplifies or suppresses the effect of top-p and top-k filtering downstream. At Tโ0 the softmax output collapses to a one-hot vector (greedy); at T>1 the distribution flattens and even low-probability tokens become viable candidates.
๐ Decoding Strategy Selection: Pipeline Flow
Each token generation step applies the selected decoding strategy to transform raw logits into the next output token.
flowchart LR
A[LLM Logits] --> B{Decoding Strategy?}
B -->|Greedy| C[argmax token]
B -->|Beam Search| D[Expand k beams]
B -->|Sampling| E[Apply Temperature]
E --> F{top-p or top-k filter}
F --> G[Sample from distribution]
D --> H[Prune to k beams]
C --> I[Next Token]
G --> I
H --> I[Best-beam token]
๐งญ Decision Guide: Decoding Strategy Selection
Choosing the right decoding strategy depends on three questions about your task: Does it need creativity? Does it need speed? Does it need globally optimal sequence quality? The decision tree below maps those constraints to concrete settings.
flowchart TD
Task["Define Generation Task"]
Creative{"Is creativity desired?"}
Speed{"Is speed critical?"}
Quality{"Is sequence quality critical?"}
Greedy["Greedy Decoding T=0, deterministic Fastest"]
Beam["Beam Search K=5 Better global quality Slower"]
TopP["Top-P Sampling P=0.9, T=0.8 Diverse & creative"]
TopK["Top-K Sampling K=40, T=1.0 Moderate diversity"]
Task --> Creative
Creative -->|No| Speed
Creative -->|Yes| TopP
Speed -->|Yes| Greedy
Speed -->|No| Quality
Quality -->|Yes| Beam
Quality -->|No| TopK
Figure: Decision tree for selecting a decoding strategy based on your task's creativity, speed, and quality requirements.
How to use this tree in practice: Start at "Define Generation Task" and answer each question honestly. A SQL generator needs neither creativity nor beam quality โ greedy wins. A chatbot needs creativity โ go Top-P. An offline translation pipeline needs quality and can afford extra compute โ beam search. A game dialogue system needs moderate diversity without the beam overhead โ Top-K. Most production APIs default to Top-P sampling with temperature tuning and do not expose beam search directly; knowing where your use case sits on this tree helps you configure temperature and top-p values intelligently.
๐ Real-World Applications of Decoding Strategies
Different applications sit in very different parts of the decision tree above. Here is how major production use cases map to strategies:
| Application | Recommended Strategy | Why |
| Machine translation | Beam Search (K=4โ8) | Sequence-level quality matters; output has a near-correct answer |
| Creative story writing | Top-P=0.9, T=0.9โ1.1 | Diversity and surprise are desirable; no single correct continuation |
| SQL generation | Greedy or T=0.0โ0.1 | SQL is a deterministic task; randomness introduces syntax errors |
| Chatbot dialogue | Top-P=0.9, T=0.7 | Natural conversational variance; avoids robotic repetition |
| Code completion | Greedy or T=0.1, top-p=0.95 | Code has correctness constraints; focused output reduces hallucination |
| Factual Q&A | T=0.0โ0.2 | Factual answers have a correct form; randomness only adds noise |
A note on production APIs: Most hosted LLM APIs (OpenAI, Anthropic, Google Gemini) do not expose beam search as a parameter. Their defaults are top-p sampling with temperature tuning. Beam search remains dominant in offline pipelines โ machine translation frameworks (MarianMT, fairseq) and speech recognition (Whisper) โ where running K beams is acceptable and sequence-level optimality is paramount.
One important misconception to clear up: setting temperature=0 in the OpenAI API gives you greedy-like output, not beam search. Many developers assume low temperature equals beam search; it does not. Beam search explores multiple sequence branches in parallel and scores them globally; temperature=0 just makes greedy decoding deterministic by collapsing the softmax to an argmax.
๐ Decoding Strategy Latency vs Diversity
flowchart TD
Token["Next Token Position"]
Greedy["Greedy: argmax(logits) โ one token, fastest"]
BeamK["Beam Search: K=5 โ K parallel sequences Kร slower, best quality"]
TopK["Top-K: sample from K โ balanced diversity/speed"]
TopP["Top-P: nucleus 0.9 โ dynamic vocab, diverse"]
Temp["Temperature T โ reshapes distribution applied before sampling"]
Token --> Greedy
Token --> BeamK
Token --> TopK
Token --> TopP
TopK --> Temp
TopP --> Temp
This diagram compares all four decoding strategies along the latency-diversity axis simultaneously. Greedy is fastest but produces no diversity; Beam Search is slowest and offers the best global sequence quality but also the least variety; Top-K and Top-P sampling occupy the middle ground โ Top-P dynamically adjusts the candidate vocabulary per step while Top-K uses a fixed cutoff. Temperature feeds into both sampling methods, making it the shared dial that amplifies or dampens diversity within whatever nucleus or top-K constraint you have set.
๐งช Exploring Decoding Strategies in Practice
Exercise 1 โ Greedy vs. Sampling Side-by-Side
Use the HuggingFace transformers library to compare strategies on the same prompt:
from transformers import pipeline
gen = pipeline("text-generation", model="gpt2")
prompt = "The future of renewable energy is"
# Greedy decoding โ deterministic, same result every run
greedy = gen(prompt, max_new_tokens=50, do_sample=False)
# Top-P nucleus sampling โ varied output each run
sampled = gen(prompt, max_new_tokens=50, do_sample=True, top_p=0.9, temperature=0.8)
print("Greedy:", greedy[0]["generated_text"])
print("Sampled:", sampled[0]["generated_text"])
Run this five times. The greedy output is identical every run; the sampled output varies. Greedy outputs often start looping or repeating phrases after a few sentences โ the classic symptom of locally optimal but globally poor token choices.
Exercise 2 โ Temperature Spectrum
Generate five completions at T=0.1, T=0.7, and T=1.5 for the prompt "The future of AI is". Observe: at T=0.1 the output is near-deterministic and focused but may repeat phrases; at T=0.7 there is a balanced diversity-coherence trade-off; at T=1.5 the output becomes creative but risks incoherence or going off-topic. This exercise builds direct intuition for the creativity-vs-coherence dial that temperature controls.
Exercise 3 โ Nucleus Size with the OpenAI API
Use the OpenAI API to test top_p=0.1 versus top_p=0.95 on a creative prompt such as "Write the opening line of a science fiction novel set on Europa". At top_p=0.1 sampling is restricted to the top 10% of probability mass โ outputs are safe and repetitive. At top_p=0.95 the model ranges across 95% of probable continuations โ outputs vary dramatically. Note how top-p and temperature interact: high top-p gives temperature room to explore; low top-p limits exploration regardless of temperature setting.
๐ง Deep Dive: Practical Defaults by Task
| Task | Recommended Settings |
| Code generation | Greedy or T=0.1, top-p=0.95 |
| Factual Q&A | T=0.0โ0.2 (deterministic) |
| Creative writing | T=0.8โ1.2, top-p=0.9 |
| API calls / JSON output | T=0.0 + strict output schema |
| Dialogue/chatbot | T=0.7, top-p=0.9 |
๐ฌ Internals
Greedy decoding selects argmax at each step: t_i = argmax P(t|t1,...,t{i-1}). Beam search maintains k partial sequences ("beams"), expanding each at every step and pruning to keep only the top-k by cumulative log-probability. Sampling draws t_i ~ P(ยท|context) after optional temperature scaling and vocabulary truncation (top-k/top-p), breaking the deterministic mode-seeking behavior of beam search.
โก Performance Analysis
Greedy decoding runs at peak throughput (~100โ150 tokens/second on an A100 for a 7B model) since it processes one candidate per step. Beam search with k=5 is ~4ร slower due to expanded batch size and is prone to repetition on open-ended tasks. Nucleus sampling (top-p=0.9) matches greedy throughput and scores 20โ30% higher on human preference evaluations for creative and conversational tasks.
๐ ๏ธ HuggingFace Transformers: Controlling Decoding with GenerationConfig
HuggingFace Transformers is the leading open-source Python library for loading and running transformer models. Its GenerationConfig object centralizes every decoding parameter โ do_sample, temperature, top_p, num_beams โ into a single, versionable configuration that can be saved alongside the model checkpoint and swapped without reloading the model.
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
# --- Strategy 1: Greedy โ deterministic, ideal for SQL / code generation ---
greedy_cfg = GenerationConfig(
max_new_tokens=128,
do_sample=False, # argmax at every step: Tโ0 behaviour
temperature=1.0,
)
# --- Strategy 2: Beam Search โ high sequence quality, good for translation ---
beam_cfg = GenerationConfig(
max_new_tokens=128,
do_sample=False,
num_beams=4, # keep 4 candidate sequences in parallel
early_stopping=True,
)
# --- Strategy 3: Nucleus Sampling โ creative, good for chat / story generation ---
nucleus_cfg = GenerationConfig(
max_new_tokens=128,
do_sample=True,
temperature=0.8, # slightly sharpen the distribution
top_p=0.92, # sample from top-92% probability mass
top_k=0, # disable top-k; use pure nucleus sampling
)
prompt = "Explain why the sky is blue in one paragraph."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
for label, cfg in [("Greedy", greedy_cfg), ("Beam", beam_cfg), ("Nucleus", nucleus_cfg)]:
with torch.inference_mode():
output = model.generate(**inputs, generation_config=cfg)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\n[{label}]\n{text[len(prompt):]}")
GenerationConfig cleanly separates your decoding strategy from model loading โ you can A/B test strategies in production by swapping configs without touching model weights. Save the winning configuration with cfg.save_pretrained("./configs/nucleus") so results are fully reproducible across environments and deployments.
For a full deep-dive on HuggingFace Transformers generation internals, constrained decoding, and fine-tuning workflows, a dedicated follow-up post is planned.
๐ Five Lessons Every LLM Developer Should Internalize
Greedy decoding is suboptimal for long sequences. Each locally best token constrains what follows โ the globally best sentence may require starting with a slightly lower-probability token. This is why greedy outputs often feel stilted or start looping after a few sentences.
Beam search improves global sequence quality but produces "safe" text. It is the right tool for translation and summarization where a near-correct answer exists, but it actively penalizes diversity. Do not use it for open-ended creative generation.
Top-P (nucleus) sampling is the most common production default. It balances quality and diversity by concentrating sampling in the high-probability region while still allowing variation. Most API defaults sit at P=0.9โ0.95.
Temperature and top-p are complementary, not alternatives. Temperature reshapes the entire probability distribution (sharpening or flattening it); top-p then truncates the distribution after reshaping. Use both together deliberately rather than treating them as the same knob.
For factual or structured output, use T=0โ0.2. SQL, JSON, code, and factual Q&A all have correct answers. Randomness only introduces noise. Near-zero temperature is the appropriate default; beam search can be layered on top in offline pipelines where sequence-level quality is paramount.
๐ TLDR: Summary & Key Takeaways
- Greedy = always pick the best token; fast but suboptimal.
- Beam Search = keep top-K candidate sequences; better quality, more compute.
- Top-P (Nucleus) Sampling = sample from the smallest set covering P% of probability mass; best for creative tasks.
- Temperature reshapes the distribution: low $T$ โ focused, high $T$ โ creative/risky.
๐ Practice Quiz
Why can greedy decoding produce poor-quality long text despite always choosing the highest-probability token?
- A) Because it picks the top token too slowly, introducing latency errors.
- B) Because each locally optimal choice constrains future choices โ the globally best sentence may require a lower-probability first token.
- C) Because it ignores the model's attention weights during generation.
- D) Because greedy decoding skips every other token to improve speed. Correct Answer: B โ Greedy decoding is a local optimization that can lead to globally suboptimal sequences; a lower-probability first token might open better continuations overall.
Setting temperature T=0 is equivalent to which decoding strategy?
- A) Beam Search with K=1.
- B) Greedy decoding โ the softmax with Tโ0 concentrates all probability mass on the single highest-logit token.
- C) Top-P sampling with P=0.5.
- D) Top-K sampling with K=1 and random selection. Correct Answer: B โ As T approaches 0, the softmax becomes a hard argmax โ the model always picks the single most likely token, making generation deterministic.
You are generating a SQL query from a natural language prompt. Which settings minimize hallucination?
- A) T=1.5 with top-p=0.99 for maximum exploration of SQL patterns.
- B) T=0.0โ0.2 with a strict output schema โ SQL has a correct answer; randomness only adds errors.
- C) Beam Search with K=5 always produces the best SQL regardless of temperature.
- D) Top-K sampling with K=5 for moderate diversity. Correct Answer: B โ SQL generation is a deterministic task with a correct answer; near-zero temperature minimizes the probability of random token substitutions that create invalid SQL.
- Open-ended challenge: You are evaluating decoding strategies for three production use cases: (1) a code autocomplete tool, (2) a creative story generator, and (3) a factual QA system. For each, specify the decoding strategy, key parameter values, and the evaluation metric you would use to validate the choice.
๐ Related Posts
- Prompt Engineering Guide: Zero-Shot to Chain-of-Thought
- Embeddings Explained
- RAG Explained: How to Give Your LLM a Brain Upgrade
- How to Build Apps with LangChain and LLMs

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
TLDR: Partitioning splits one logical table into smaller physical pieces called partitions. The database planner skips irrelevant partitions entirely โ turning a 30-second full-table scan into a 200ms single-partition read. Range partitioning is best...

Dirty Write Explained: When Uncommitted Data Gets Overwritten
TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error โ it is silently inconsistent committed data: one table reflects Transaction B's intent, anot...

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects
TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time โ one before and one after a concurrent transaction commits โ producing a view that never existed as a committed whole. Read Committed isolation...

Lost Update Explained: When Two Writes Become One
TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back โ with the second write silently discarding the first. No error is raised. Both tr...
