All Posts

LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained

Abstract AlgorithmsAbstract Algorithms
··5 min read

TL;DR

TLDR: Hyperparameters are the knobs you turn before generating text. Temperature controls randomness (Creativity vs. Focus). Top-P controls the vocabulary pool (Diversity). Frequency Penalty stops the model from repeating itself. Knowing how to tune ...

Cover Image for LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained

TLDR: Hyperparameters are the knobs you turn before generating text. Temperature controls randomness (Creativity vs. Focus). Top-P controls the vocabulary pool (Diversity). Frequency Penalty stops the model from repeating itself. Knowing how to tune these is the difference between a hallucinating bot and a reliable coding assistant.


1. The "Knobs" of Generation (The No-Jargon Explanation)

Imagine an LLM is a Jazz Musician.

  • Temperature: How much improvisation is allowed?
    • Low (0.1): Play the sheet music exactly as written. (Boring, precise).
    • High (0.9): Go wild! Play random notes! (Creative, risky).
  • Top-P (Nucleus Sampling): How many different notes can you choose from?
    • Low (0.1): Only pick from the top 3 safest notes.
    • High (0.9): You can pick from almost any note in the scale.

2. Deep Dive: The Core Parameters

A. Temperature (0.0 to 2.0)

Controls the randomness of the next token prediction.

  • Math: It scales the logits before the Softmax function.

    $$ P_i = \frac{\exp(z_i / T)}{\sum \exp(z_j / T)} $$

  • Effect:
    • Low ($T < 0.5$): The model becomes confident. It picks the most likely word almost every time.
    • High ($T > 1.0$): The model flattens the probability curve. Rare words become more likely.

B. Top-P (0.0 to 1.0)

Also known as Nucleus Sampling. Instead of picking from all words, the model only considers the top words whose cumulative probability adds up to $P$.

  • Effect:
    • Low ($P = 0.1$): Only considers the top 1-2 words. Very deterministic.
    • High ($P = 0.9$): Considers a wide range of words. More diverse vocabulary.

Pro Tip: Generally, change either Temperature or Top-P, but not both. They do similar things.

C. Top-K (Integer, e.g., 40)

Limits the model to pick from the top $K$ most likely words.

  • Effect: Hard cutoff. Even if the $(K+1)$th word is perfectly valid, it is ignored. This prevents the model from going completely off the rails into nonsense.

D. Frequency/Presence Penalty (-2.0 to 2.0)

  • Frequency Penalty: Penalizes words based on how many times they have already appeared in the text. (Stops: "and and and and").
  • Presence Penalty: Penalizes words if they have appeared at least once. (Encourages new topics).

3. Control Parameters: Length & Sampling

These parameters control how and how much the model generates.

A. do_sample (True/False)

  • True: The model picks the next word randomly based on probabilities (using Temperature/Top-P).
  • False: The model always picks the single most likely word (Greedy Search).
  • Usage: Set to False for math/coding where there is only one right answer.

B. max_new_tokens (Integer)

  • Definition: The maximum number of new tokens the model is allowed to generate.
  • Usage: Prevents the model from rambling on forever and eating up your API budget.
  • Note: This is different from context_length (which includes the input prompt).

C. min_length (Integer)

  • Definition: Forces the model to generate at least $N$ tokens.
  • Usage: Useful for summarization tasks where you don't want a one-word answer like "Good."

D. repetition_penalty (Float, usually 1.0 to 1.2)

  • Definition: A hard penalty applied to the logits of already generated tokens.
  • Math: If a token has appeared, divide its logit by this penalty.
  • Usage: Stronger than Frequency Penalty. Use it if the model gets stuck in a loop like "I went to the the the the...".

4. The Cheat Sheet: Scenarios & Values

Use this table as a starting point for your applications.

ScenarioTemperatureTop-Pdo_sampleRepetition PenaltyWhy?
Code Generation0.0 - 0.20.1False1.0Code must be precise. Syntax errors are fatal. We want the most likely (correct) token every time.
Data Extraction0.00.0False1.0We want consistent JSON/CSV output. No creativity allowed.
Chatbot (Support)0.5 - 0.70.8True1.05Friendly but accurate. Needs to vary phrasing slightly but stay on topic.
Creative Writing0.8 - 1.00.9True1.1 - 1.2Needs to be surprising and avoid repetition. High temp allows "interesting" word choices.
Brainstorming1.0+1.0True1.0We want wild ideas. Hallucination is actually a feature here.

5. Real-World Application: Tuning a Summarizer

  • Goal: Summarize a news article.
  • Attempt 1 (Temp 1.0): The model adds its own opinions and uses flowery language. (Bad).
  • Attempt 2 (Temp 0.0): The model copies sentences verbatim from the text. (Boring, but accurate).
  • Optimal (Temp 0.3): The model rephrases sentences slightly but sticks strictly to the facts.

Summary & Key Takeaways

  • Temperature: Controls "Risk". Low = Safe/Repetitive. High = Creative/Random.
  • Top-P: Controls "Vocabulary Size". Low = Limited. High = Diverse.
  • do_sample: Set to False for deterministic tasks (Math/Code).
  • Repetition Penalty: Use this if the model gets stuck in loops.

Practice Quiz: Test Your Intuition

  1. Scenario: You are building a SQL query generator. The user asks "Show me all users." You want the model to output SELECT * FROM users; every single time. What settings do you use?

    • A) Temp = 0.0, do_sample = False
    • B) Temp = 1.0, do_sample = True
    • C) Top-P = 0.9
  2. Scenario: Your chatbot keeps repeating the same phrase "I apologize for the inconvenience" three times in one paragraph. Which parameter should you increase?

    • A) Temperature
    • B) Repetition Penalty
    • C) Top-K
  3. Scenario: Why is High Temperature bad for math problems?

    • A) It makes the model slower.
    • B) Math has only one correct answer. High temp makes the model pick "less likely" (wrong) numbers.
    • C) It uses more tokens.

(Answers: 1-A, 2-B, 3-B)

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms