All Posts

LLM Terms You Should Know: A Helpful Glossary

A dictionary for the language of Large Language Models. This guide decodes the essential jargon, from Attention to Zero-Shot.

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read
Cover Image for LLM Terms You Should Know: A Helpful Glossary
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: The world of LLMs has its own dense vocabulary. This post is your decoder ring โ€” covering foundation terms (tokens, context window), generation settings (temperature, top-p), safety concepts (hallucination, grounding), and architecture terms (attention, fine-tuning, RAG).


๐Ÿ“– Why LLM Jargon Matters (and How to Learn It Fast)

You're reading a paper about GPT-4 and hit "temperature", "top-p", "logits" in the same paragraph. None are defined. This glossary exists for that moment.

Example โ€” reading an API config: A colleague sends you { "temperature": 0.2, "top_p": 0.9, "max_tokens": 512 }. Right now that might look like noise. By the end of this post you'll know that temperature: 0.2 means "stay focused, don't improvise", top_p: 0.9 means "consider only the most probable 90% of next-token candidates", and max_tokens: 512 caps the response at roughly 380 words.

But the field comes with a steep vocabulary. Understanding the terms unlocks the ability to make better product decisions, debug unexpected behavior, and communicate clearly with AI practitioners.

This glossary is organized by topic: foundation, generation, safety, architecture, and deployment.


๐Ÿ” Foundation Terms: Tokens, Context, and Embeddings

Token The basic unit of text an LLM processes. A token is roughly 4 characters or ยพ of a word in English. "Hello world" โ‰ˆ 2โ€“3 tokens. LLMs read and output text as sequences of tokens, not characters.

Context Window The maximum number of tokens an LLM can process in a single interaction โ€” both the prompt and the response combined. GPT-4 has a 128k-token context window (โ‰ˆ 100,000 words). Requests longer than the context window are truncated.

Embedding A dense vector (list of numbers) that represents the meaning of a piece of text in high-dimensional space. Semantically similar phrases have similar embeddings. Used for search, similarity, and retrieval.

Prompt The input text you provide to an LLM. Everything from instructions to examples to context goes in the prompt.

Completion The LLM's output in response to a prompt. Also called the generation or response.

TermSimple definition
Token~4 chars; building block of text for LLMs
Context windowMax total tokens the model can see at once
EmbeddingNumber vector representing text meaning
PromptInput text to the model
CompletionOutput text from the model

๐Ÿ“Š LLM Key Concepts Map

flowchart TD
    LLM[Large Language Model] --> TK[Tokenization]
    LLM --> EM[Embeddings]
    LLM --> AT[Attention]
    LLM --> FT[Fine-tuning]
    LLM --> RL[RLHF]
    RL --> IT[Instruction Tuning]

โš™๏ธ Generation Controls: Temperature, Top-p, and Top-k

These settings control how "creative" vs "deterministic" the model's output is.

Temperature Scales the probability distribution before sampling. Higher = more random, more creative; lower = more focused, more predictable.

  • temperature = 0.0 โ†’ always picks the most likely token (deterministic)
  • temperature = 1.0 โ†’ samples according to the raw probability distribution
  • temperature > 1.0 โ†’ more random; occasional incoherence

Top-p (nucleus sampling) Only sample from the smallest set of tokens whose cumulative probability exceeds p. top_p = 0.9 means "only consider tokens whose combined probability mass is in the top 90%." Ignores low-probability tokens dynamically.

Top-k Only sample from the top k most probable next tokens. top_k = 40 means always pick from the 40 highest-probability candidates.

graph LR
    A[Logits from model] --> B[Apply Temperature scaling]
    B --> C{Sampling method}
    C -->|Top-k| D[Keep top k tokens]
    C -->|Top-p| E[Keep tokens until cumulative prob = p]
    D --> F[Sample final token]
    E --> F
SettingLow value effectHigh value effect
TemperatureDeterministic, repetitiveCreative, unpredictable
Top-p (0โ€“1)Narrow, safe choicesBroader, more varied
Top-kVery focusedMore exploratory

๐Ÿ”’ Safety and Quality Terms: Hallucination, Grounding, and Bias

Hallucination When an LLM confidently generates false or nonsensical information. The model predicts probable-sounding tokens without checking facts.

  • Example: Ask "Who was the 12th US President in 1750?" and the model invents a name.
  • Mitigation: Grounding, RAG, source attribution, user education.

Grounding Connecting model output to verifiable sources or real-world data to reduce hallucinations.

  • Example: A customer-support bot that only answers from the company's help-center documents.
  • Method: Retrieval-Augmented Generation (RAG) โ€” retrieve relevant documents and pass them in the prompt.

Bias Systematic errors or prejudices inherited from training data.

  • Example: A model trained on older text may associate "Doctor" with male pronouns.
  • Mitigation: Diverse training data, debiasing techniques, output auditing.

Guardrails Rules, filters, or model layers that prevent the LLM from producing harmful, inappropriate, or policy-violating content. Implemented via system prompts, output classifiers, or constitutional AI techniques.


๐Ÿง  Deep Dive: Architecture Terms: Attention, Fine-Tuning, and RAG

Attention (Self-Attention) The core mechanism of Transformer models. Each token attends to every other token in the context window, weighted by relevance. This is what allows LLMs to handle long-range dependencies.

Fine-Tuning Additional training of a pre-trained model on a smaller, task-specific dataset to improve performance for a specific domain or task format.

  • Pre-training: Train on the internet (general language).
  • Fine-tuning: Train on medical records โ†’ a medical LLM.
  • RLHF (Reinforcement Learning from Human Feedback): Fine-tune using human preference rankings to align the model with desired behavior.

RAG (Retrieval-Augmented Generation) A technique that retrieves relevant documents from an external knowledge base and injects them into the prompt before generation. Solves the knowledge-cutoff problem without retraining.

Zero-Shot / Few-Shot

  • Zero-shot: No examples provided โ€” just ask directly.
  • Few-shot: Provide 2โ€“5 examples in the prompt to guide format and style.
  • Chain-of-thought: Ask the model to reason step-by-step before giving a final answer.
Architecture termKey idea
AttentionEach token relates to all others; enables context comprehension
Fine-tuningAdapt a pre-trained model to specific tasks
RLHFAlign model behavior using human preference data
RAGInject retrieved documents to ground generation in current facts
Zero-shotTask without examples
Few-shotTask with 2โ€“5 in-prompt examples

๐Ÿ“Š RAG vs Fine-tuning

flowchart LR
    NW[New Knowledge] --> D{Update method?}
    D -- Static --> FT[Fine-tune model]
    D -- Dynamic --> RAG[RAG retrieval]
    FT --> MO[Model weights updated]
    RAG --> CT[Context injected]

๐Ÿ“Š How an LLM Processes a Request

Understanding the data flow from your prompt to the model's response helps demystify every term in this glossary. Each step maps to one or more of the concepts defined above.

graph TD
    A[User types a Prompt] --> B[Tokenizer splits prompt into Tokens]
    B --> C[Token IDs fed into LLM within Context Window]
    C --> D[Self-Attention layers compute relationships between all tokens]
    D --> E[Logits output โ€” raw scores for every possible next token]
    E --> F{Sampling settings applied}
    F -->|Temperature + Top-p/Top-k| G[Next token sampled]
    G --> H{End of sequence?}
    H -->|No| C
    H -->|Yes| I[Completion returned to user]

This loop โ€” tokenize, attend, sample, repeat โ€” is the heartbeat of every LLM interaction. Parameters like temperature and top-p only affect step F; everything else is deterministic given the same weights.

Embedding retrieval in a RAG system adds a branch before step A: the query is embedded into a vector, the nearest document chunks are retrieved from a vector store, and those chunks are prepended to the prompt. The rest of the loop is unchanged.


๐ŸŒ Real-World Applications: Deployment Terms: Inference, Quantization, and Latency

Inference The process of generating output from a trained model given an input. This is what happens at serving time (as opposed to training time).

Quantization Reducing model precision (e.g., from 32-bit floats to 8-bit integers) to shrink model size and speed up inference, at a small accuracy cost. Essential for running large models on constrained hardware.

Latency vs. Throughput

  • Latency: Time from prompt submission to first token received (Time To First Token, TTFT).
  • Throughput: Tokens generated per second across all parallel requests.

Prompt Engineering The practice of designing prompts to reliably elicit the desired behavior from an LLM, without changing model weights.


โš–๏ธ Trade-offs & Failure Modes: LLM Generation Settings

SettingRisk when too lowRisk when too high
TemperatureRepetitive, loops on same phrasesIncoherent, factually unreliable output
Context window usageCritical context gets truncatedHigher inference cost and latency
Top-k / Top-pOver-constrained, stiff responsesNoise from low-probability tokens

Hallucination is the dominant failure mode across all LLM deployments. Grounding via RAG or source-constrained system prompts is the primary mitigation. Fine-tuning reduces hallucination in narrow domains but does not eliminate it.


๐Ÿงญ Decision Guide: Quick Reference: Term โ†’ Concept in One Line

TermOne-line definition
TokenUnit of text an LLM processes (~4 chars)
Context windowMax tokens input + output combined
TemperatureControls randomness of generation
Top-pNucleus sampling โ€” limits to top cumulative-probability tokens
HallucinationModel confidently generates false information
GroundingConnecting output to verifiable sources
Fine-tuningAdapting a pre-trained model to a specific task
RAGRetrieving documents to inject into the prompt for factual grounding
RLHFTraining with human preference rankings
QuantizationReducing numeric precision to shrink model size

๐ŸŽฏ What to Learn Next


๐Ÿงช Hands-On: Spot the Term in the Wild

Reading definitions is step one. Recognizing terms in real product documentation and error messages is step two. Work through each scenario below and identify the relevant concept.

Scenario 1 โ€” You query GPT-4 with a 150,000-word document and receive an error saying the input exceeds the model limit. Which term applies? Context window โ€” the document exceeds the 128k-token context limit. Solution: chunk the document and use RAG to retrieve only the relevant sections.

Scenario 2 โ€” A chatbot confidently tells a user that a product was launched in 2023, but the product actually launched in 2024. Which term applies? Hallucination โ€” the model generated a plausible-sounding but incorrect date. Mitigation: ground the bot with a RAG pipeline pointing to current product documentation.

Scenario 3 โ€” You set temperature=0 for a code generation task and notice the model produces identical output on every run. Which term applies? Temperature โ€” setting it to zero makes sampling deterministic; the model always picks the highest-probability next token.

Scenario 4 โ€” Your team wants to deploy a 70B parameter model on a laptop GPU with 8 GB VRAM. Which technique is required? Quantization โ€” reducing weights from 32-bit floats to 4-bit integers (GGUF/GGML format) makes this feasible while preserving most model quality.

Scenario 5 โ€” A developer says "We fine-tuned the model on our support tickets but it still hallucinates product specs." What would you recommend? RAG โ€” fine-tuning teaches style and format but does not reliably inject factual data. A RAG pipeline with an up-to-date product spec index directly addresses this failure mode.


๐Ÿ› ๏ธ HuggingFace Transformers: Seeing Every LLM Term in Running Code

HuggingFace Transformers is the dominant open-source library for working with LLMs in Python โ€” it exposes tokenizers, generation configs, embedding models, and inference pipelines through a unified API, making it the fastest way to see every term in this glossary in action with real, runnable code.

The library demonstrates all the key concepts from this post concretely: you can inspect tokens, tune temperature/top-p, measure context window usage, and run embedding similarity in under 30 lines of code.

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sentence_transformers import SentenceTransformer
import torch
import torch.nn.functional as F

# โ”€โ”€ Token: how text becomes integers โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization converts text to token IDs."
tokens = tokenizer(text)
print("Token IDs:  ", tokens["input_ids"])
# โ†’ [30642, 1634, 4578, 2420, 284, 11241, 32373, 13]
print("Token count:", len(tokens["input_ids"]))  # โ†’ 8
print("Decoded: ", [tokenizer.decode([t]) for t in tokens["input_ids"]])
# โ†’ ['Token', 'ization', ' converts', ' text', ' to', ' token', ' IDs', '.']

# โ”€โ”€ Context Window: checking whether a prompt fits โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
MAX_TOKENS = 1024   # GPT-2's context window
long_text = "word " * 500  # 500 words โ‰ˆ 667 tokens
encoded_len = len(tokenizer(long_text)["input_ids"])
print(f"Prompt fits in context: {encoded_len < MAX_TOKENS}")  # โ†’ True

# โ”€โ”€ Temperature + Top-p + Top-k: controlling generation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model = AutoModelForCausalLM.from_pretrained("gpt2")

gen_config_focused = GenerationConfig(
    max_new_tokens=30,
    temperature=0.2,    # low โ†’ focused, near-deterministic
    top_p=0.9,          # nucleus sampling: keep top 90% probability mass
    top_k=40,           # also limit to top 40 token candidates
    do_sample=True,
)

gen_config_creative = GenerationConfig(
    max_new_tokens=30,
    temperature=1.1,    # high โ†’ creative, more surprising
    top_p=0.95,
    do_sample=True,
)

prompt = tokenizer("The best database for caching is", return_tensors="pt")
with torch.no_grad():
    focused_ids  = model.generate(**prompt, generation_config=gen_config_focused)
    creative_ids = model.generate(**prompt, generation_config=gen_config_creative)

print("Focused:  ", tokenizer.decode(focused_ids[0],  skip_special_tokens=True))
print("Creative: ", tokenizer.decode(creative_ids[0], skip_special_tokens=True))

# โ”€โ”€ Embedding + Cosine Similarity (Grounding / RAG basis) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
embed_model = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast embedding model

query = "What database is good for caching?"
docs  = [
    "Redis is an in-memory key-value store used for caching.",
    "PostgreSQL is a relational database for ACID transactions.",
    "MongoDB is a document store for flexible schemas.",
]

query_vec = torch.tensor(embed_model.encode(query))
doc_vecs  = torch.tensor(embed_model.encode(docs))

similarities = F.cosine_similarity(query_vec.unsqueeze(0), doc_vecs)
best_idx = similarities.argmax().item()
print(f"\nMost relevant doc (sim={similarities[best_idx]:.3f}): {docs[best_idx]}")
# โ†’ Most relevant doc (simโ‰ˆ0.72): Redis is an in-memory key-value store...

Every term in this glossary โ€” token, context window, temperature, top-p, embedding, cosine similarity โ€” appears as a named variable or parameter in this snippet. The GenerationConfig object is the production-safe way to set sampling parameters in HuggingFace; it avoids the common mistake of passing them as positional arguments that are silently overridden.

For a full deep-dive on HuggingFace Transformers for inference and fine-tuning, a dedicated follow-up post is planned.


๐Ÿ“š Key Lessons from LLM Vocabulary

Learning these terms correctly prevents costly mistakes when building AI products.

Tokens are the unit of cost, not words. API pricing is per token. A 10,000-word document is roughly 13,000 tokens in English. In Japanese or Korean the same semantic content may be 2โ€“3ร— more tokens. Always count tokens before estimating bills.

Hallucination is a feature, not a bug โ€” in the wrong place. The same probabilistic generation that produces creative fiction also invents fake citations. For factual tasks, always apply grounding via RAG or strict output validation.

Temperature is not quality. Higher temperature does not mean better responses. For coding, math, and factual retrieval, low temperature (0โ€“0.3) produces more reliable output. For creative tasks, moderate temperature (0.7โ€“1.0) adds useful variety.

RAG and fine-tuning are complementary, not competing. Fine-tuning adapts style and format; RAG provides fresh facts. The highest-quality production systems combine both.

Prompt engineering is real engineering. Well-structured prompts with clear instructions, few-shot examples, and explicit output format constraints dramatically improve reliability โ€” often more than switching to a larger model.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • LLMs process text as tokens, not characters โ€” understanding token count matters for prompt design and cost.
  • Temperature, top-p, and top-k are knobs that trade off creativity vs. predictability.
  • Hallucination is a fundamental property of probabilistic generation โ€” grounding and RAG reduce it.
  • Fine-tuning adapts a general model to a domain; RAG grounds it in current facts without retraining.
  • RLHF aligns model behavior with human preferences โ€” the primary technique behind instruction-following models like ChatGPT.

๐Ÿ“ Practice Quiz

  1. What does "temperature = 0" mean for LLM generation?

    • A) The model refuses to generate
    • B) The model always picks the most probable next token, producing deterministic output
    • C) The model generates very long responses
    • D) The model uses only its training data

    Correct Answer: B โ€” Temperature = 0 removes all randomness; the model greedily selects the highest-probability token at every step.

  2. What is the key advantage of RAG over fine-tuning for keeping an LLM up to date?

    • A) RAG is always cheaper to run
    • B) RAG retrieves current documents at inference time without retraining the model
    • C) RAG improves the model's mathematical reasoning
    • D) RAG increases the model's context window

    Correct Answer: B โ€” Fine-tuning knowledge is frozen at training time. RAG queries a live index so the model can answer about events that happened after its training cutoff.

  3. What causes LLM hallucination?

    • A) The context window is too small
    • B) The model predicts probable-sounding tokens based on patterns, without verifying factual accuracy
    • C) Temperature is set too low
    • D) The prompt is too short

    Correct Answer: B โ€” LLMs are next-token predictors. They output statistically plausible sequences even when no factual grounding exists for the claim.



Tags

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms