14 min readAi Llm

LLM Terms You Should Know: A Helpful Glossary

A dictionary for the language of Large Language Models. This guide decodes the essential jargon, from Attention to Zero-Shot.

Abstract Algorithms/Feb 11, 2026/LLM Engineering

Executive TLDR

TLDR: The world of LLMs has its own dense vocabulary.
This post is your decoder ring — covering foundation terms (tokens, context window), generation settings (temperature, top p), safety concepts (hallucination, grounding), and architecture terms (attention, fine tuning, RAG).
📖 Why LLM Jargon Matters (and How to Learn It Fast) You're reading a paper about GPT 4 and hit "temperature", "top p", "logits" in the same paragraph.
This glossary exists for that moment.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

A dictionary for the language of Large Language Models. This guide decodes the essential jargon, from Attention to Zero-Shot.

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 Why LLM Jargon Matters (and How to Learn It Fast)

🔍 Foundation Terms: Tokens, Context, and Embeddings

⚙️ Generation Controls: Temperature, Top-p, and Top-k

🔒 Safety and Quality Terms: Hallucination, Grounding, and Bias

🧠 Deep Dive: Architecture Terms: Attention, Fine-Tuning, and RAG

TLDR: The world of LLMs has its own dense vocabulary. This post is your decoder ring — covering foundation terms (tokens, context window), generation settings (temperature, top-p), safety concepts (hallucination, grounding), and architecture terms (attention, fine-tuning, RAG).

📖 Why LLM Jargon Matters (and How to Learn It Fast)

You're reading a paper about GPT-4 and hit "temperature", "top-p", "logits" in the same paragraph. None are defined. This glossary exists for that moment.

Example — reading an API config: A colleague sends you { "temperature": 0.2, "top_p": 0.9, "max_tokens": 512 }. Right now that might look like noise. By the end of this post you'll know that temperature: 0.2 means "stay focused, don't improvise", top_p: 0.9 means "consider only the most probable 90% of next-token candidates", and max_tokens: 512 caps the response at roughly 380 words.

But the field comes with a steep vocabulary. Understanding the terms unlocks the ability to make better product decisions, debug unexpected behavior, and communicate clearly with AI practitioners.

This glossary is organized by topic: foundation, generation, safety, architecture, and deployment.

🔍 Foundation Terms: Tokens, Context, and Embeddings

Token The basic unit of text an LLM processes. A token is roughly 4 characters or ¾ of a word in English. "Hello world" ≈ 2–3 tokens. LLMs read and output text as sequences of tokens, not characters.

Context Window The maximum number of tokens an LLM can process in a single interaction — both the prompt and the response combined. GPT-4 has a 128k-token context window (≈ 100,000 words). Requests longer than the context window are truncated.

Embedding A dense vector (list of numbers) that represents the meaning of a piece of text in high-dimensional space. Semantically similar phrases have similar embeddings. Used for search, similarity, and retrieval.

Prompt The input text you provide to an LLM. Everything from instructions to examples to context goes in the prompt.

Completion The LLM's output in response to a prompt. Also called the generation or response.

Term	Simple definition
Token	~4 chars; building block of text for LLMs
Context window	Max total tokens the model can see at once
Embedding	Number vector representing text meaning
Prompt	Input text to the model
Completion	Output text from the model

📊 LLM Key Concepts Map

flowchart TD
    LLM[Large Language Model] --> TK[Tokenization]
    LLM --> EM[Embeddings]
    LLM --> AT[Attention]
    LLM --> FT[Fine-tuning]
    LLM --> RL[RLHF]
    RL --> IT[Instruction Tuning]

⚙️ Generation Controls: Temperature, Top-p, and Top-k

These settings control how "creative" vs "deterministic" the model's output is.

Temperature Scales the probability distribution before sampling. Higher = more random, more creative; lower = more focused, more predictable.

temperature = 0.0 → always picks the most likely token (deterministic)
temperature = 1.0 → samples according to the raw probability distribution
temperature > 1.0 → more random; occasional incoherence

Top-p (nucleus sampling) Only sample from the smallest set of tokens whose cumulative probability exceeds p. top_p = 0.9 means "only consider tokens whose combined probability mass is in the top 90%." Ignores low-probability tokens dynamically.

Top-k Only sample from the top k most probable next tokens. top_k = 40 means always pick from the 40 highest-probability candidates.

graph LR
    A[Logits from model] --> B[Apply Temperature scaling]
    B --> C{Sampling method}
    C -->|Top-k| D[Keep top k tokens]
    C -->|Top-p| E[Keep tokens until cumulative prob = p]
    D --> F[Sample final token]
    E --> F

Setting	Low value effect	High value effect
Temperature	Deterministic, repetitive	Creative, unpredictable
Top-p (0–1)	Narrow, safe choices	Broader, more varied
Top-k	Very focused	More exploratory

🔒 Safety and Quality Terms: Hallucination, Grounding, and Bias

Hallucination When an LLM confidently generates false or nonsensical information. The model predicts probable-sounding tokens without checking facts.

Example: Ask "Who was the 12th US President in 1750?" and the model invents a name.
Mitigation: Grounding, RAG, source attribution, user education.

Grounding Connecting model output to verifiable sources or real-world data to reduce hallucinations.

Example: A customer-support bot that only answers from the company's help-center documents.
Method: Retrieval-Augmented Generation (RAG) — retrieve relevant documents and pass them in the prompt.

Bias Systematic errors or prejudices inherited from training data.

Example: A model trained on older text may associate "Doctor" with male pronouns.
Mitigation: Diverse training data, debiasing techniques, output auditing.

Guardrails Rules, filters, or model layers that prevent the LLM from producing harmful, inappropriate, or policy-violating content. Implemented via system prompts, output classifiers, or constitutional AI techniques.

🧠 Deep Dive: Architecture Terms: Attention, Fine-Tuning, and RAG

Attention (Self-Attention) The core mechanism of Transformer models. Each token attends to every other token in the context window, weighted by relevance. This is what allows LLMs to handle long-range dependencies.

Fine-Tuning Additional training of a pre-trained model on a smaller, task-specific dataset to improve performance for a specific domain or task format.

Pre-training: Train on the internet (general language).
Fine-tuning: Train on medical records → a medical LLM.
RLHF (Reinforcement Learning from Human Feedback): Fine-tune using human preference rankings to align the model with desired behavior.

RAG (Retrieval-Augmented Generation) A technique that retrieves relevant documents from an external knowledge base and injects them into the prompt before generation. Solves the knowledge-cutoff problem without retraining.

Zero-Shot / Few-Shot

Zero-shot: No examples provided — just ask directly.
Few-shot: Provide 2–5 examples in the prompt to guide format and style.
Chain-of-thought: Ask the model to reason step-by-step before giving a final answer.

Architecture term	Key idea
Attention	Each token relates to all others; enables context comprehension
Fine-tuning	Adapt a pre-trained model to specific tasks
RLHF	Align model behavior using human preference data
RAG	Inject retrieved documents to ground generation in current facts
Zero-shot	Task without examples
Few-shot	Task with 2–5 in-prompt examples

📊 RAG vs Fine-tuning

flowchart LR
    NW[New Knowledge] --> D{Update method?}
    D -- Static --> FT[Fine-tune model]
    D -- Dynamic --> RAG[RAG retrieval]
    FT --> MO[Model weights updated]
    RAG --> CT[Context injected]

📊 How an LLM Processes a Request

Understanding the data flow from your prompt to the model's response helps demystify every term in this glossary. Each step maps to one or more of the concepts defined above.

graph TD
    A[User types a Prompt] --> B[Tokenizer splits prompt into Tokens]
    B --> C[Token IDs fed into LLM within Context Window]
    C --> D[Self-Attention layers compute relationships between all tokens]
    D --> E[Logits output  raw scores for every possible next token]
    E --> F{Sampling settings applied}
    F -->|Temperature + Top-p/Top-k| G[Next token sampled]
    G --> H{End of sequence?}
    H -->|No| C
    H -->|Yes| I[Completion returned to user]

This loop — tokenize, attend, sample, repeat — is the heartbeat of every LLM interaction. Parameters like temperature and top-p only affect step F; everything else is deterministic given the same weights.

Embedding retrieval in a RAG system adds a branch before step A: the query is embedded into a vector, the nearest document chunks are retrieved from a vector store, and those chunks are prepended to the prompt. The rest of the loop is unchanged.

🌍 Real-World Applications: Deployment Terms: Inference, Quantization, and Latency

Inference The process of generating output from a trained model given an input. This is what happens at serving time (as opposed to training time).

Quantization Reducing model precision (e.g., from 32-bit floats to 8-bit integers) to shrink model size and speed up inference, at a small accuracy cost. Essential for running large models on constrained hardware.

Latency vs. Throughput

Latency: Time from prompt submission to first token received (Time To First Token, TTFT).
Throughput: Tokens generated per second across all parallel requests.

Prompt Engineering The practice of designing prompts to reliably elicit the desired behavior from an LLM, without changing model weights.

⚖️ Trade-offs & Failure Modes: LLM Generation Settings

Setting	Risk when too low	Risk when too high
Temperature	Repetitive, loops on same phrases	Incoherent, factually unreliable output
Context window usage	Critical context gets truncated	Higher inference cost and latency
Top-k / Top-p	Over-constrained, stiff responses	Noise from low-probability tokens

Hallucination is the dominant failure mode across all LLM deployments. Grounding via RAG or source-constrained system prompts is the primary mitigation. Fine-tuning reduces hallucination in narrow domains but does not eliminate it.

🧭 Decision Guide: Quick Reference: Term → Concept in One Line

Term	One-line definition
Token	Unit of text an LLM processes (~4 chars)
Context window	Max tokens input + output combined
Temperature	Controls randomness of generation
Top-p	Nucleus sampling — limits to top cumulative-probability tokens
Hallucination	Model confidently generates false information
Grounding	Connecting output to verifiable sources
Fine-tuning	Adapting a pre-trained model to a specific task
RAG	Retrieving documents to inject into the prompt for factual grounding
RLHF	Training with human preference rankings
Quantization	Reducing numeric precision to shrink model size

🎯 What to Learn Next

🧪 Hands-On: Spot the Term in the Wild

Reading definitions is step one. Recognizing terms in real product documentation and error messages is step two. Work through each scenario below and identify the relevant concept.

Scenario 1 — You query GPT-4 with a 150,000-word document and receive an error saying the input exceeds the model limit. Which term applies? Context window — the document exceeds the 128k-token context limit. Solution: chunk the document and use RAG to retrieve only the relevant sections.

Scenario 2 — A chatbot confidently tells a user that a product was launched in 2023, but the product actually launched in 2024. Which term applies? Hallucination — the model generated a plausible-sounding but incorrect date. Mitigation: ground the bot with a RAG pipeline pointing to current product documentation.

Scenario 3 — You set temperature=0 for a code generation task and notice the model produces identical output on every run. Which term applies? Temperature — setting it to zero makes sampling deterministic; the model always picks the highest-probability next token.

Scenario 4 — Your team wants to deploy a 70B parameter model on a laptop GPU with 8 GB VRAM. Which technique is required? Quantization — reducing weights from 32-bit floats to 4-bit integers (GGUF/GGML format) makes this feasible while preserving most model quality.

Scenario 5 — A developer says "We fine-tuned the model on our support tickets but it still hallucinates product specs." What would you recommend? RAG — fine-tuning teaches style and format but does not reliably inject factual data. A RAG pipeline with an up-to-date product spec index directly addresses this failure mode.

🛠️ HuggingFace Transformers: Seeing Every LLM Term in Running Code

HuggingFace Transformers is the dominant open-source library for working with LLMs in Python — it exposes tokenizers, generation configs, embedding models, and inference pipelines through a unified API, making it the fastest way to see every term in this glossary in action with real, runnable code.

The library demonstrates all the key concepts from this post concretely: you can inspect tokens, tune temperature/top-p, measure context window usage, and run embedding similarity in under 30 lines of code.

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sentence_transformers import SentenceTransformer
import torch
import torch.nn.functional as F

# ── Token: how text becomes integers ─────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization converts text to token IDs."
tokens = tokenizer(text)
print("Token IDs:  ", tokens["input_ids"])
# → [30642, 1634, 4578, 2420, 284, 11241, 32373, 13]
print("Token count:", len(tokens["input_ids"]))  # → 8
print("Decoded: ", [tokenizer.decode([t]) for t in tokens["input_ids"]])
# → ['Token', 'ization', ' converts', ' text', ' to', ' token', ' IDs', '.']

# ── Context Window: checking whether a prompt fits ────────────────────────────
MAX_TOKENS = 1024   # GPT-2's context window
long_text = "word " * 500  # 500 words ≈ 667 tokens
encoded_len = len(tokenizer(long_text)["input_ids"])
print(f"Prompt fits in context: {encoded_len < MAX_TOKENS}")  # → True

# ── Temperature + Top-p + Top-k: controlling generation ──────────────────────
model = AutoModelForCausalLM.from_pretrained("gpt2")

gen_config_focused = GenerationConfig(
    max_new_tokens=30,
    temperature=0.2,    # low → focused, near-deterministic
    top_p=0.9,          # nucleus sampling: keep top 90% probability mass
    top_k=40,           # also limit to top 40 token candidates
    do_sample=True,
)

gen_config_creative = GenerationConfig(
    max_new_tokens=30,
    temperature=1.1,    # high → creative, more surprising
    top_p=0.95,
    do_sample=True,
)

prompt = tokenizer("The best database for caching is", return_tensors="pt")
with torch.no_grad():
    focused_ids  = model.generate(**prompt, generation_config=gen_config_focused)
    creative_ids = model.generate(**prompt, generation_config=gen_config_creative)

print("Focused:  ", tokenizer.decode(focused_ids[0],  skip_special_tokens=True))
print("Creative: ", tokenizer.decode(creative_ids[0], skip_special_tokens=True))

# ── Embedding + Cosine Similarity (Grounding / RAG basis) ─────────────────────
embed_model = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast embedding model

query = "What database is good for caching?"
docs  = [
    "Redis is an in-memory key-value store used for caching.",
    "PostgreSQL is a relational database for ACID transactions.",
    "MongoDB is a document store for flexible schemas.",
]

query_vec = torch.tensor(embed_model.encode(query))
doc_vecs  = torch.tensor(embed_model.encode(docs))

similarities = F.cosine_similarity(query_vec.unsqueeze(0), doc_vecs)
best_idx = similarities.argmax().item()
print(f"\nMost relevant doc (sim={similarities[best_idx]:.3f}): {docs[best_idx]}")
# → Most relevant doc (sim≈0.72): Redis is an in-memory key-value store...

Every term in this glossary — token, context window, temperature, top-p, embedding, cosine similarity — appears as a named variable or parameter in this snippet. The GenerationConfig object is the production-safe way to set sampling parameters in HuggingFace; it avoids the common mistake of passing them as positional arguments that are silently overridden.

For a full deep-dive on HuggingFace Transformers for inference and fine-tuning, a dedicated follow-up post is planned.

📚 Key Lessons from LLM Vocabulary

Learning these terms correctly prevents costly mistakes when building AI products.

Tokens are the unit of cost, not words. API pricing is per token. A 10,000-word document is roughly 13,000 tokens in English. In Japanese or Korean the same semantic content may be 2–3× more tokens. Always count tokens before estimating bills.

Hallucination is a feature, not a bug — in the wrong place. The same probabilistic generation that produces creative fiction also invents fake citations. For factual tasks, always apply grounding via RAG or strict output validation.

Temperature is not quality. Higher temperature does not mean better responses. For coding, math, and factual retrieval, low temperature (0–0.3) produces more reliable output. For creative tasks, moderate temperature (0.7–1.0) adds useful variety.

RAG and fine-tuning are complementary, not competing. Fine-tuning adapts style and format; RAG provides fresh facts. The highest-quality production systems combine both.

Prompt engineering is real engineering. Well-structured prompts with clear instructions, few-shot examples, and explicit output format constraints dramatically improve reliability — often more than switching to a larger model.

📌 TLDR: Summary & Key Takeaways

LLMs process text as tokens, not characters — understanding token count matters for prompt design and cost.
Temperature, top-p, and top-k are knobs that trade off creativity vs. predictability.
Hallucination is a fundamental property of probabilistic generation — grounding and RAG reduce it.
Fine-tuning adapts a general model to a domain; RAG grounds it in current facts without retraining.
RLHF aligns model behavior with human preferences — the primary technique behind instruction-following models like ChatGPT.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Related deep dives

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

31 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

31 min read

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min · Ann · best next step

Open Collection