All Posts

Text Decoding Strategies: Greedy, Beam Search, and Sampling

How does an LLM choose the next word? It's not just random. We explore Greedy Search, Beam Search...

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: An LLM doesn't "write" text โ€” it generates a probability distribution over all possible next tokens and then uses a decoding strategy to pick one. Greedy, Beam Search, and Sampling are different rules for that choice. Temperature controls the creativity knob.


๐Ÿ“– The "Finish the Sentence" Game

Imagine flashing the phrase "The dog jumps over the ___" to a crowd:

  • 60% shout "Fence"
  • 20% shout "Cat"
  • 15% shout "River"
  • 5% shout "Moon"

The LLM does this for every single token. The question is: do you always pick "Fence" (greedy), do you explore multiple branches (beam search), or do you randomly sample from the crowd (sampling)?


๐Ÿ”ข The Three Core Strategies

1. Greedy Decoding

Always pick the single highest-probability token at each step.

Input: "The dog jumps over the"
Step 1: "fence" (highest probability token)
Output: "The dog jumps over the fence."

Pros: Fast, deterministic, reproducible. Cons: Locally optimal choices lead to globally suboptimal sequences. The model can paint itself into a corner โ€” e.g., choosing "fence" first makes "and runs away" the natural continuation, even if "cat and runs away together" would have scored higher as a complete sentence.

Keep the top-K complete sequences (beams) in parallel. At each step, expand all K beams, score them, and keep only the top K.

K=3 beams after step 1:
  Beam 1: "... fence" (log-prob: -0.5)
  Beam 2: "... cat"   (log-prob: -1.2)
  Beam 3: "... river" (log-prob: -1.8)

After step 2 (expand each beam):
  Beam 1a: "... fence and ..."  (-0.5 + -0.3)
  Beam 1b: "... fence to ..."   (-0.5 + -0.6)
  Beam 2a: "... cat and ..."    (-1.2 + -0.2)  โ† might beat 1b!
  ...

Pros: Better global sequence scores than greedy. Cons: Computationally Kร— more expensive. Beam outputs tend to be generic/safe โ€” less creative.

3. Sampling

Draw a token randomly according to the probability distribution โ€” not always the top one.

Pros: Diverse, creative outputs. Reduces repetition. Cons: Can produce incoherent text if the probability distribution is too flat.

Top-K Sampling

Only sample from the K highest-probability tokens. Prevents picking absurdly rare tokens.

Top-P (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability โ‰ฅ P.

P=0.9: include tokens until you've covered 90% of probability mass
If vocab: [fence: 60%, cat: 20%, river: 15%, moon: 5%]
Top-P=0.9 set: [fence(60%), cat(20%), river(15%)] โ†’ sample from these 3
StrategyDiversityQualityUse Case
GreedyNoneHighFactual Q&A, code generation
Beam Search (K=5)LowHighTranslation, summarization
Top-K SamplingMediumMediumCreative writing
Top-P SamplingHighMedium-HighDialogue, creative tasks

โš™๏ธ Temperature: The Creativity Dial

Temperature $T$ reshapes the probability distribution before sampling:

$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

Where $z_i$ are the raw logit scores.

TemperatureEffectOutput Character
$T = 0$Greedy (argmax) โ€” one token wins entirelyDeterministic, repetitive
$T = 1.0$Standard distribution (no change)Balanced
$T < 1$ (e.g., 0.3)Sharper distribution โ€” top token dominates even moreFocused, precise
$T > 1$ (e.g., 1.8)Flatter distribution โ€” rare tokens become more likelyCreative, potentially incoherent

Numerical example at $T = 0.5$ vs $T = 2.0$ for two tokens A (logit 2.0) and B (logit 1.0):

TemperatureP(A)P(B)Interpretation
1.073%27%Standard
0.598%2%Very confident in A
2.062%38%B nearly as likely as A
flowchart LR
    Logits["Raw Logits z_i"] --> Temp["Divide by T"]
    Temp --> Softmax["Softmax โ†’ Probabilities"]
    Softmax --> Decode["Decoding Strategy\n(greedy / top-k / nucleus)"]
    Decode --> Token["Selected Token"]

๐Ÿง  Practical Defaults

TaskRecommended Settings
Code generationGreedy or T=0.1, top-p=0.95
Factual Q&AT=0.0โ€“0.2 (deterministic)
Creative writingT=0.8โ€“1.2, top-p=0.9
API calls / JSON outputT=0.0 + strict output schema
Dialogue/chatbotT=0.7, top-p=0.9

๐Ÿ“Œ Summary

  • LLMs produce a probability distribution over the vocabulary at each step; decoding strategy picks the token.
  • Greedy = always pick the best token; fast but suboptimal globally.
  • Beam Search = keep top-K candidate sequences; better quality, more compute.
  • Top-P (Nucleus) Sampling = sample from the smallest set covering P% of probability mass; best for creative tasks.
  • Temperature reshapes the distribution: low $T$ โ†’ focused, high $T$ โ†’ creative/risky.

๐Ÿ“ Practice Quiz

  1. Why can greedy decoding produce poor-quality long text despite always choosing the highest-probability token?

    • A) Because it picks the top token too slowly.
    • B) Because each locally optimal choice constrains future choices โ€” the globally best sentence may require a lower-probability first token.
    • C) Because it ignores the model's attention weights.
      Answer: B
  2. Setting temperature $T = 0$ is equivalent to which decoding strategy?

    • A) Beam Search with K=1.
    • B) Greedy decoding โ€” the softmax with $T โ†’ 0$ gives all probability mass to the argmax token.
    • C) Top-P sampling with P=0.5.
      Answer: B
  3. You are generating a SQL query from a natural language prompt. Which settings minimize hallucination?

    • A) T=1.5 with top-p=0.99 for maximum exploration.
    • B) T=0.0โ€“0.2 (near-greedy) with a strict output schema โ€” SQL has a correct answer; randomness only adds errors.
    • C) Beam Search with K=5 always produces the best SQL.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms