LLM Terms You Should Know: A Helpful Glossary
A dictionary for the language of Large Language Models. This guide decodes the essential jargon, from Attention to Zero-Shot.
Abstract Algorithms
TLDR: The world of LLMs has its own dense vocabulary. This post is your decoder ring โ covering foundation terms (tokens, context window), generation settings (temperature, top-p), safety concepts (hallucination, grounding), and architecture terms (attention, fine-tuning, RAG).
๐ Why LLM Jargon Matters (and How to Learn It Fast)
You're reading a paper about GPT-4 and hit "temperature", "top-p", "logits" in the same paragraph. None are defined. This glossary exists for that moment.
Example โ reading an API config: A colleague sends you { "temperature": 0.2, "top_p": 0.9, "max_tokens": 512 }. Right now that might look like noise. By the end of this post you'll know that temperature: 0.2 means "stay focused, don't improvise", top_p: 0.9 means "consider only the most probable 90% of next-token candidates", and max_tokens: 512 caps the response at roughly 380 words.
But the field comes with a steep vocabulary. Understanding the terms unlocks the ability to make better product decisions, debug unexpected behavior, and communicate clearly with AI practitioners.
This glossary is organized by topic: foundation, generation, safety, architecture, and deployment.
๐ Foundation Terms: Tokens, Context, and Embeddings
Token The basic unit of text an LLM processes. A token is roughly 4 characters or ยพ of a word in English. "Hello world" โ 2โ3 tokens. LLMs read and output text as sequences of tokens, not characters.
Context Window The maximum number of tokens an LLM can process in a single interaction โ both the prompt and the response combined. GPT-4 has a 128k-token context window (โ 100,000 words). Requests longer than the context window are truncated.
Embedding A dense vector (list of numbers) that represents the meaning of a piece of text in high-dimensional space. Semantically similar phrases have similar embeddings. Used for search, similarity, and retrieval.
Prompt The input text you provide to an LLM. Everything from instructions to examples to context goes in the prompt.
Completion The LLM's output in response to a prompt. Also called the generation or response.
| Term | Simple definition |
| Token | ~4 chars; building block of text for LLMs |
| Context window | Max total tokens the model can see at once |
| Embedding | Number vector representing text meaning |
| Prompt | Input text to the model |
| Completion | Output text from the model |
๐ LLM Key Concepts Map
flowchart TD
LLM[Large Language Model] --> TK[Tokenization]
LLM --> EM[Embeddings]
LLM --> AT[Attention]
LLM --> FT[Fine-tuning]
LLM --> RL[RLHF]
RL --> IT[Instruction Tuning]
โ๏ธ Generation Controls: Temperature, Top-p, and Top-k
These settings control how "creative" vs "deterministic" the model's output is.
Temperature Scales the probability distribution before sampling. Higher = more random, more creative; lower = more focused, more predictable.
temperature = 0.0โ always picks the most likely token (deterministic)temperature = 1.0โ samples according to the raw probability distributiontemperature > 1.0โ more random; occasional incoherence
Top-p (nucleus sampling)
Only sample from the smallest set of tokens whose cumulative probability exceeds p. top_p = 0.9 means "only consider tokens whose combined probability mass is in the top 90%." Ignores low-probability tokens dynamically.
Top-k
Only sample from the top k most probable next tokens. top_k = 40 means always pick from the 40 highest-probability candidates.
graph LR
A[Logits from model] --> B[Apply Temperature scaling]
B --> C{Sampling method}
C -->|Top-k| D[Keep top k tokens]
C -->|Top-p| E[Keep tokens until cumulative prob = p]
D --> F[Sample final token]
E --> F
| Setting | Low value effect | High value effect |
| Temperature | Deterministic, repetitive | Creative, unpredictable |
| Top-p (0โ1) | Narrow, safe choices | Broader, more varied |
| Top-k | Very focused | More exploratory |
๐ Safety and Quality Terms: Hallucination, Grounding, and Bias
Hallucination When an LLM confidently generates false or nonsensical information. The model predicts probable-sounding tokens without checking facts.
- Example: Ask "Who was the 12th US President in 1750?" and the model invents a name.
- Mitigation: Grounding, RAG, source attribution, user education.
Grounding Connecting model output to verifiable sources or real-world data to reduce hallucinations.
- Example: A customer-support bot that only answers from the company's help-center documents.
- Method: Retrieval-Augmented Generation (RAG) โ retrieve relevant documents and pass them in the prompt.
Bias Systematic errors or prejudices inherited from training data.
- Example: A model trained on older text may associate "Doctor" with male pronouns.
- Mitigation: Diverse training data, debiasing techniques, output auditing.
Guardrails Rules, filters, or model layers that prevent the LLM from producing harmful, inappropriate, or policy-violating content. Implemented via system prompts, output classifiers, or constitutional AI techniques.
๐ง Deep Dive: Architecture Terms: Attention, Fine-Tuning, and RAG
Attention (Self-Attention) The core mechanism of Transformer models. Each token attends to every other token in the context window, weighted by relevance. This is what allows LLMs to handle long-range dependencies.
Fine-Tuning Additional training of a pre-trained model on a smaller, task-specific dataset to improve performance for a specific domain or task format.
- Pre-training: Train on the internet (general language).
- Fine-tuning: Train on medical records โ a medical LLM.
- RLHF (Reinforcement Learning from Human Feedback): Fine-tune using human preference rankings to align the model with desired behavior.
RAG (Retrieval-Augmented Generation) A technique that retrieves relevant documents from an external knowledge base and injects them into the prompt before generation. Solves the knowledge-cutoff problem without retraining.
Zero-Shot / Few-Shot
- Zero-shot: No examples provided โ just ask directly.
- Few-shot: Provide 2โ5 examples in the prompt to guide format and style.
- Chain-of-thought: Ask the model to reason step-by-step before giving a final answer.
| Architecture term | Key idea |
| Attention | Each token relates to all others; enables context comprehension |
| Fine-tuning | Adapt a pre-trained model to specific tasks |
| RLHF | Align model behavior using human preference data |
| RAG | Inject retrieved documents to ground generation in current facts |
| Zero-shot | Task without examples |
| Few-shot | Task with 2โ5 in-prompt examples |
๐ RAG vs Fine-tuning
flowchart LR
NW[New Knowledge] --> D{Update method?}
D -- Static --> FT[Fine-tune model]
D -- Dynamic --> RAG[RAG retrieval]
FT --> MO[Model weights updated]
RAG --> CT[Context injected]
๐ How an LLM Processes a Request
Understanding the data flow from your prompt to the model's response helps demystify every term in this glossary. Each step maps to one or more of the concepts defined above.
graph TD
A[User types a Prompt] --> B[Tokenizer splits prompt into Tokens]
B --> C[Token IDs fed into LLM within Context Window]
C --> D[Self-Attention layers compute relationships between all tokens]
D --> E[Logits output โ raw scores for every possible next token]
E --> F{Sampling settings applied}
F -->|Temperature + Top-p/Top-k| G[Next token sampled]
G --> H{End of sequence?}
H -->|No| C
H -->|Yes| I[Completion returned to user]
This loop โ tokenize, attend, sample, repeat โ is the heartbeat of every LLM interaction. Parameters like temperature and top-p only affect step F; everything else is deterministic given the same weights.
Embedding retrieval in a RAG system adds a branch before step A: the query is embedded into a vector, the nearest document chunks are retrieved from a vector store, and those chunks are prepended to the prompt. The rest of the loop is unchanged.
๐ Real-World Applications: Deployment Terms: Inference, Quantization, and Latency
Inference The process of generating output from a trained model given an input. This is what happens at serving time (as opposed to training time).
Quantization Reducing model precision (e.g., from 32-bit floats to 8-bit integers) to shrink model size and speed up inference, at a small accuracy cost. Essential for running large models on constrained hardware.
Latency vs. Throughput
- Latency: Time from prompt submission to first token received (Time To First Token, TTFT).
- Throughput: Tokens generated per second across all parallel requests.
Prompt Engineering The practice of designing prompts to reliably elicit the desired behavior from an LLM, without changing model weights.
โ๏ธ Trade-offs & Failure Modes: LLM Generation Settings
| Setting | Risk when too low | Risk when too high |
| Temperature | Repetitive, loops on same phrases | Incoherent, factually unreliable output |
| Context window usage | Critical context gets truncated | Higher inference cost and latency |
| Top-k / Top-p | Over-constrained, stiff responses | Noise from low-probability tokens |
Hallucination is the dominant failure mode across all LLM deployments. Grounding via RAG or source-constrained system prompts is the primary mitigation. Fine-tuning reduces hallucination in narrow domains but does not eliminate it.
๐งญ Decision Guide: Quick Reference: Term โ Concept in One Line
| Term | One-line definition |
| Token | Unit of text an LLM processes (~4 chars) |
| Context window | Max tokens input + output combined |
| Temperature | Controls randomness of generation |
| Top-p | Nucleus sampling โ limits to top cumulative-probability tokens |
| Hallucination | Model confidently generates false information |
| Grounding | Connecting output to verifiable sources |
| Fine-tuning | Adapting a pre-trained model to a specific task |
| RAG | Retrieving documents to inject into the prompt for factual grounding |
| RLHF | Training with human preference rankings |
| Quantization | Reducing numeric precision to shrink model size |
๐ฏ What to Learn Next
- Tokenization Explained: How LLMs Understand Text
- LLM Hyperparameters Guide: Temperature, Top-p, and Top-k
- RAG Explained: How to Give Your LLM a Brain Upgrade
๐งช Hands-On: Spot the Term in the Wild
Reading definitions is step one. Recognizing terms in real product documentation and error messages is step two. Work through each scenario below and identify the relevant concept.
Scenario 1 โ You query GPT-4 with a 150,000-word document and receive an error saying the input exceeds the model limit. Which term applies? Context window โ the document exceeds the 128k-token context limit. Solution: chunk the document and use RAG to retrieve only the relevant sections.
Scenario 2 โ A chatbot confidently tells a user that a product was launched in 2023, but the product actually launched in 2024. Which term applies? Hallucination โ the model generated a plausible-sounding but incorrect date. Mitigation: ground the bot with a RAG pipeline pointing to current product documentation.
Scenario 3 โ You set temperature=0 for a code generation task and notice the model produces identical output on every run. Which term applies? Temperature โ setting it to zero makes sampling deterministic; the model always picks the highest-probability next token.
Scenario 4 โ Your team wants to deploy a 70B parameter model on a laptop GPU with 8 GB VRAM. Which technique is required? Quantization โ reducing weights from 32-bit floats to 4-bit integers (GGUF/GGML format) makes this feasible while preserving most model quality.
Scenario 5 โ A developer says "We fine-tuned the model on our support tickets but it still hallucinates product specs." What would you recommend? RAG โ fine-tuning teaches style and format but does not reliably inject factual data. A RAG pipeline with an up-to-date product spec index directly addresses this failure mode.
๐ ๏ธ HuggingFace Transformers: Seeing Every LLM Term in Running Code
HuggingFace Transformers is the dominant open-source library for working with LLMs in Python โ it exposes tokenizers, generation configs, embedding models, and inference pipelines through a unified API, making it the fastest way to see every term in this glossary in action with real, runnable code.
The library demonstrates all the key concepts from this post concretely: you can inspect tokens, tune temperature/top-p, measure context window usage, and run embedding similarity in under 30 lines of code.
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sentence_transformers import SentenceTransformer
import torch
import torch.nn.functional as F
# โโ Token: how text becomes integers โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization converts text to token IDs."
tokens = tokenizer(text)
print("Token IDs: ", tokens["input_ids"])
# โ [30642, 1634, 4578, 2420, 284, 11241, 32373, 13]
print("Token count:", len(tokens["input_ids"])) # โ 8
print("Decoded: ", [tokenizer.decode([t]) for t in tokens["input_ids"]])
# โ ['Token', 'ization', ' converts', ' text', ' to', ' token', ' IDs', '.']
# โโ Context Window: checking whether a prompt fits โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MAX_TOKENS = 1024 # GPT-2's context window
long_text = "word " * 500 # 500 words โ 667 tokens
encoded_len = len(tokenizer(long_text)["input_ids"])
print(f"Prompt fits in context: {encoded_len < MAX_TOKENS}") # โ True
# โโ Temperature + Top-p + Top-k: controlling generation โโโโโโโโโโโโโโโโโโโโโโ
model = AutoModelForCausalLM.from_pretrained("gpt2")
gen_config_focused = GenerationConfig(
max_new_tokens=30,
temperature=0.2, # low โ focused, near-deterministic
top_p=0.9, # nucleus sampling: keep top 90% probability mass
top_k=40, # also limit to top 40 token candidates
do_sample=True,
)
gen_config_creative = GenerationConfig(
max_new_tokens=30,
temperature=1.1, # high โ creative, more surprising
top_p=0.95,
do_sample=True,
)
prompt = tokenizer("The best database for caching is", return_tensors="pt")
with torch.no_grad():
focused_ids = model.generate(**prompt, generation_config=gen_config_focused)
creative_ids = model.generate(**prompt, generation_config=gen_config_creative)
print("Focused: ", tokenizer.decode(focused_ids[0], skip_special_tokens=True))
print("Creative: ", tokenizer.decode(creative_ids[0], skip_special_tokens=True))
# โโ Embedding + Cosine Similarity (Grounding / RAG basis) โโโโโโโโโโโโโโโโโโโโโ
embed_model = SentenceTransformer("all-MiniLM-L6-v2") # small, fast embedding model
query = "What database is good for caching?"
docs = [
"Redis is an in-memory key-value store used for caching.",
"PostgreSQL is a relational database for ACID transactions.",
"MongoDB is a document store for flexible schemas.",
]
query_vec = torch.tensor(embed_model.encode(query))
doc_vecs = torch.tensor(embed_model.encode(docs))
similarities = F.cosine_similarity(query_vec.unsqueeze(0), doc_vecs)
best_idx = similarities.argmax().item()
print(f"\nMost relevant doc (sim={similarities[best_idx]:.3f}): {docs[best_idx]}")
# โ Most relevant doc (simโ0.72): Redis is an in-memory key-value store...
Every term in this glossary โ token, context window, temperature, top-p, embedding, cosine similarity โ appears as a named variable or parameter in this snippet. The GenerationConfig object is the production-safe way to set sampling parameters in HuggingFace; it avoids the common mistake of passing them as positional arguments that are silently overridden.
For a full deep-dive on HuggingFace Transformers for inference and fine-tuning, a dedicated follow-up post is planned.
๐ Key Lessons from LLM Vocabulary
Learning these terms correctly prevents costly mistakes when building AI products.
Tokens are the unit of cost, not words. API pricing is per token. A 10,000-word document is roughly 13,000 tokens in English. In Japanese or Korean the same semantic content may be 2โ3ร more tokens. Always count tokens before estimating bills.
Hallucination is a feature, not a bug โ in the wrong place. The same probabilistic generation that produces creative fiction also invents fake citations. For factual tasks, always apply grounding via RAG or strict output validation.
Temperature is not quality. Higher temperature does not mean better responses. For coding, math, and factual retrieval, low temperature (0โ0.3) produces more reliable output. For creative tasks, moderate temperature (0.7โ1.0) adds useful variety.
RAG and fine-tuning are complementary, not competing. Fine-tuning adapts style and format; RAG provides fresh facts. The highest-quality production systems combine both.
Prompt engineering is real engineering. Well-structured prompts with clear instructions, few-shot examples, and explicit output format constraints dramatically improve reliability โ often more than switching to a larger model.
๐ TLDR: Summary & Key Takeaways
- LLMs process text as tokens, not characters โ understanding token count matters for prompt design and cost.
- Temperature, top-p, and top-k are knobs that trade off creativity vs. predictability.
- Hallucination is a fundamental property of probabilistic generation โ grounding and RAG reduce it.
- Fine-tuning adapts a general model to a domain; RAG grounds it in current facts without retraining.
- RLHF aligns model behavior with human preferences โ the primary technique behind instruction-following models like ChatGPT.
๐ Practice Quiz
What does "temperature = 0" mean for LLM generation?
- A) The model refuses to generate
- B) The model always picks the most probable next token, producing deterministic output
- C) The model generates very long responses
- D) The model uses only its training data
Correct Answer: B โ Temperature = 0 removes all randomness; the model greedily selects the highest-probability token at every step.
What is the key advantage of RAG over fine-tuning for keeping an LLM up to date?
- A) RAG is always cheaper to run
- B) RAG retrieves current documents at inference time without retraining the model
- C) RAG improves the model's mathematical reasoning
- D) RAG increases the model's context window
Correct Answer: B โ Fine-tuning knowledge is frozen at training time. RAG queries a live index so the model can answer about events that happened after its training cutoff.
What causes LLM hallucination?
- A) The context window is too small
- B) The model predicts probable-sounding tokens based on patterns, without verifying factual accuracy
- C) Temperature is set too low
- D) The prompt is too short
Correct Answer: B โ LLMs are next-token predictors. They output statistically plausible sequences even when no factual grounding exists for the claim.
๐ Related Posts
- Tokenization Explained: How LLMs Understand Text
- RAG Explained: How to Give Your LLM a Brain Upgrade
- Embeddings Explained

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions โ with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy โ but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose โ range, hash, consistent hashing, or directory โ determines whether range queries stay ch...
