LLM Hallucinations: Causes, Detection, and Mitigation Strategies
Why LLMs confidently make things up, how to detect it, and the practical strategies engineers use to keep production AI grounded
Abstract Algorithms
TLDR: LLMs hallucinate because they are trained to predict the next plausible token β not the next true token. Understanding the three hallucination types (factual, faithfulness, open-domain) plus the five root causes lets you choose the right mitigation: RAG for knowledge gaps, consistency sampling for detection, system-prompt grounding for faithfulness, and NLI-based pipelines for automated verification.
π¨ The $500,000 Legal Brief That Cited Cases That Never Existed
In May 2023, New York attorney Steven Schwartz filed a legal brief in federal court on behalf of his client Roberto Mata in a case against Avianca airlines. The brief cited no fewer than six legal cases as precedents β Varghese v. China Southern Airlines, Martinez v. Delta Air Lines, and four others β complete with specific case numbers, court names, dates, and detailed legal reasoning excerpted from each ruling.
Every single one of those cases was fictional. ChatGPT had invented them.
When opposing counsel found no record of any of these precedents, the court demanded that Schwartz explain. He submitted an affidavit admitting he had used ChatGPT to help draft the brief and had not independently verified the citations. Schwartz and his firm were fined $5,000 and publicly sanctioned by Judge P. Kevin Castel.
This incident made headlines worldwide, but the same failure pattern plays out daily, quietly, across thousands of production applications:
- A customer-service chatbot at an airline told a bereaved traveler he was entitled to a bereavement discount β a policy the airline had discontinued years earlier, but the LLM confidently stated it as current fact. That incident, involving Air Canada, led to a small-claims tribunal ruling that the airline was bound by what its chatbot said.
- A medical information assistant presented a user with a drug interaction warning citing a dosage threshold that had been revised in a 2021 FDA update the model had never seen β except it didn't say "I'm unsure"; it stated the old dosage with the same tone of authority as verified facts.
- A coding assistant documented a Python library function with a signature that doesn't exist β the library did exist, the function name was plausible, but the API was invented. Developers copied the snippet, wasted hours debugging, and lost trust in the tooling.
The common thread: the model was confident when it should have been uncertain. Understanding why that happens β mechanically, structurally β is the first step to building systems that don't behave this way.
π Not All Hallucinations Are Equal: A Taxonomy That Actually Matters
The word "hallucination" gets applied loosely to any LLM error, which makes it nearly impossible to fix because the root causes differ. Research in NLP has converged on three distinct types, and each needs a different remedy.
Factual Hallucinations (Extrinsic)
The model asserts something that is verifiably false β a case that doesn't exist, a drug dosage that's wrong, a statute that was never enacted. The model's output contradicts world knowledge. This is what most people mean when they say "the AI made something up."
Why it's hard: There is no ground truth in the prompt. The model is drawing entirely from its training data, and that data may be wrong, outdated, or underrepresented for this specific domain.
Faithfulness Hallucinations (Intrinsic)
The model's answer contradicts the context it was given. You provide a document, and the model's summary or answer says something the document doesn't say β or says the opposite. This is sometimes called an intrinsic hallucination because the error is internal to the provided context.
Why it's hard: The model is given all the information it needs but still drifts. This happens even in RAG systems where documents are retrieved correctly.
Open-Domain Hallucinations
The model extrapolates plausibly but incorrectly beyond what was asked or what it knows. Fictional legal cases are a perfect example: they are structurally plausible (they have all the right parts of a case citation), but the content is fabricated.
The key difference between types:
| Type | Contradicts | Detectable without external data? | Primary Fix |
| Factual | World knowledge | No β needs fact DB or retrieval | RAG, knowledge base grounding |
| Faithfulness | Provided context | Yes β via NLI entailment | Context-aware validation, re-prompting |
| Open-Domain | Both or neither | Partially β via consistency checks | Sampling-based detection, citation requirements |
Understanding this taxonomy before you start engineering is not pedantry β it determines whether your mitigation strategy will actually work.
π How Hallucinations Show Up Before You Know to Look for Them
Before you can detect and mitigate hallucinations, you need to recognize them in the wild. The problem is that hallucinated outputs look identical to correct outputs β same fluency, same confident tone, same well-formed prose. Here are the three observable patterns that should trigger a closer look.
Pattern A: Specificity without a source. Hallucinations often come loaded with suspiciously specific detail β exact case numbers, precise percentages, specific dates β that sound authoritative but cannot be verified. Real facts can also be specific, so this isn't definitive, but unprompted specificity in a domain where the model's training data is thin is a strong signal.
Pattern B: The plausible-but-unverifiable claim. The attorney's chatbot didn't invent random nonsense β it invented cases that had all the structural properties of real cases. When a model's output passes a "does this look right?" sniff test but fails a "does this actually exist?" lookup, you are looking at an open-domain hallucination.
Pattern C: Contextual drift in long conversations. In multi-turn conversations or long RAG documents, models gradually drift from the provided context. The first answer accurately references a policy document. Five turns later, the model is paraphrasing a version of the policy it invented, while the original document is still technically in the context window.
The practical takeaway: hallucination detection cannot be done by reading LLM outputs and asking "does this seem right?" β it requires systematic external verification or automated detection pipelines, which is exactly what the rest of this post covers.
βοΈ Why the Next-Token Objective Doesn't Care About Truth
To understand why hallucinations are structural rather than incidental, you have to understand what language models are actually trained to do. There are five interlocking root causes.
1. The Training Objective Is Prediction, Not Verification
A transformer language model is trained to minimize cross-entropy loss over a corpus of text β in plain language, it learns to predict what word comes next given all the words before. This is extraordinarily powerful for fluency and coherence. But "what token is most likely here?" is a completely different question from "what token is factually correct here?"
When the training data says "The capital of France is Paris," the model associates France + capital β Paris with high probability. That's correct. But the same mechanism produces "The treaty was signed in [plausible-sounding city that fits the sentence]" with equal confidence, because the training signal was fluency, not accuracy.
2. Memorization Patterns Create Confident Misgeneralization
LLMs memorize some facts verbatim from training data (especially high-frequency facts), but they generalize from patterns for low-frequency knowledge. If a model has seen thousands of legal citations in its training corpus, it has learned the pattern of how a legal citation looks β court name, year, party names, holding summary β and can generate convincing-looking citations on demand. The pattern is real; the specific citation may not be.
3. Knowledge Cutoffs Create Silent Gaps
Models have a training cutoff date. After that date, the world changes but the model's weights don't. The model doesn't know it doesn't know about post-cutoff events β it has no mechanism for expressing "I have no data after October 2023." Instead, it applies its learned patterns to questions about recent events and generates plausible-sounding but outdated or fabricated answers.
4. Overconfident Probability Distributions
LLMs are calibrated during RLHF to sound helpful and confident. Users rate confident answers higher than uncertain ones in feedback, which means fine-tuning inadvertently penalizes appropriate uncertainty. The result: a model that says "The recommended dose is 500mg twice daily" rather than "I'm not certain β please verify with a pharmacist."
5. Prompt Ambiguity and Context Overload
Vague prompts force the model to fill in gaps with learned patterns. Long context windows introduce a related problem: the model can "forget" key facts from the middle of a long context (the "lost in the middle" phenomenon). Both lead to answers that drift from the ground truth in the provided context.
The following diagram shows how these causes converge into a hallucinated output at inference time.
flowchart TD
A[User Query] --> B{Is answer in training data at high frequency?}
B -->|Yes - well-memorized| C[Accurate factual recall]
B -->|No - low frequency or post-cutoff| D[Pattern-based generation]
D --> E{Is context provided in prompt?}
E -->|Yes| F{Does model attend to context fully?}
F -->|Yes| G[Faithful answer]
F -->|No - lost in middle or drift| H[Faithfulness hallucination]
E -->|No - open domain question| I[Extrapolation from learned patterns]
I --> J[Plausible but potentially fabricated output]
J --> K[RLHF confidence calibration]
K --> L[Confident-sounding hallucinated response]
This diagram traces the decision path inside an LLM during inference. Starting from the user query, the model first draws on training-time memorization. For well-represented facts, this path leads to accurate recall. For low-frequency knowledge or anything post-cutoff, the model switches to pattern-based generation β and if no grounding context is provided, it extrapolates freely. The final stage, RLHF confidence calibration, ensures that even fabricated answers arrive wrapped in confident language, making them indistinguishable from reliable answers without external verification.
π§ Deep Dive: Inside the LLM's Confabulation Machinery
Understanding why hallucinations are structurally inevitable at the architecture level β not just conceptually β gives you a much stronger foundation for choosing where and how to intervene.
The Internals: Attention, Memorization, and the Embedding Space
A transformer's attention mechanism learns to weight tokens in the context window by their relevance to the current token being generated. For well-memorized facts, the model has essentially encoded the fact as a high-probability association between key tokens. When you ask "What is the boiling point of water?" the model activates strong, consistent associations from millions of training examples.
But when you ask about a specific legal ruling from 2019, the model activates weaker, less consistent associations across a sparse set of training examples. In the absence of strong memorized signal, the model does what it was trained to do: it generates the most statistically plausible continuation. If legal rulings in the training corpus typically say "The court held that X bears liability for Y," the model will produce a sentence matching that pattern β with X and Y filled in by whatever tokens maximize local probability.
This is why hallucinations are not random noise. They are structurally plausible extrapolations from learned patterns, which makes them dangerous precisely because they are difficult to distinguish from accurate outputs without external verification.
The "lost in the middle" phenomenon adds another failure mode: attention in transformer models tends to be stronger at the beginning and end of the context window. Facts buried in the middle of a long retrieved document are attended to less reliably β the model may acknowledge they exist but generate an answer that contradicts them, because the local token probability distribution from training data overrides the weaker attention signal from the buried context.
Performance Analysis: The Hallucination Tax on Production Systems
Hallucinations carry a direct operational cost that engineers often underestimate until they hit production:
Throughput cost of detection: Running NLI validation on every LLM output adds a synchronous model inference call per response. A DeBERTa-small NLI model runs in 20β50ms on a CPU instance; a larger model or GPU-based cross-encoder runs in 5β15ms. At 1,000 requests/minute, this adds 20β50 CPU-seconds of NLI compute per minute β a predictable linear cost.
Throughput cost of consistency sampling: SelfCheckGPT-style sampling multiplies your LLM API cost by the number of samples (typically 3β7). At GPT-3.5 pricing, 5 samples instead of 1 means 5Γ the cost per query. Reserve this for high-stakes queries; it's not suitable as a blanket policy for cost-sensitive applications.
Latency cost of RAG retrieval: Adding a vector retrieval step before generation typically adds 50β200ms of wall-clock latency (for a well-tuned vector DB with fewer than 10M embeddings). This is usually acceptable but must be measured against your P99 latency SLO before deploying.
The human review floor: For regulated domains, no amount of automated detection eliminates the need for human review of a subset of outputs. Budget 0.5β2% of queries for human spot-check in high-stakes applications. Automated detection reduces how much you send to human review, not whether you need it.
π Visualizing the Three-Layer Defense Against Hallucinations
The following diagram maps the three hallucination types from the taxonomy to their corresponding detection strategy and mitigation layer, giving a single reference view of how all the pieces fit together.
flowchart TD
A[LLM Output] --> B{Hallucination Type}
B -->|Factual - contradicts world knowledge| C[RAG Grounding Layer]
B -->|Faithfulness - contradicts provided context| D[NLI Entailment Check]
B -->|Open-Domain - extrapolated fabrication| E[Consistency Sampling]
C --> F{Retrieved docs support the claim?}
F -->|Yes| G[Serve grounded answer]
F -->|No| H[Return: I cannot verify this]
D --> I{Entailment score above threshold?}
I -->|Yes - entailed| G
I -->|No - contradiction or neutral| H
E --> J{Consistency score above threshold?}
J -->|Yes - stable across samples| G
J -->|No - divergent outputs| H
This diagram shows how each hallucination type routes to its most effective detection layer. Factual hallucinations are best caught by checking whether a retrieved authoritative document supports the claim (RAG layer). Faithfulness hallucinations β where the answer contradicts the provided context β are best caught by NLI entailment scoring. Open-domain hallucinations with no external reference are caught by consistency sampling. All three paths converge on the same binary outcome: serve the grounded answer or return a safe fallback. The key engineering decision is which layer(s) to activate based on your application's query types and risk profile.
π§ͺ Detecting and Mitigating Hallucinations: Three Runnable Python Patterns
Let's move from theory to code. The three patterns below cover the main engineering approaches: cross-encoder-based similarity detection, RAG-based mitigation with a simple vector store, and SelfCheckGPT-style consistency sampling.
Pattern 1: Cross-Encoder Similarity Detection
This approach compares an LLM's answer against a known-good reference document. A cross-encoder (rather than a bi-encoder) jointly encodes both texts and produces a calibrated similarity score, making it more sensitive to semantic contradiction than cosine similarity.
"""
Pattern 1: Detect potential hallucinations using a cross-encoder.
Requires: pip install sentence-transformers
"""
from sentence_transformers import CrossEncoder
# Load a cross-encoder fine-tuned on NLI (Natural Language Inference)
# This model classifies: contradiction / neutral / entailment
model = CrossEncoder("cross-encoder/nli-deberta-v3-small")
def detect_hallucination(reference_text: str, model_answer: str) -> dict:
"""
Returns a dict with NLI label and confidence scores.
'contradiction' indicates likely hallucination.
'entailment' indicates the answer is grounded in the reference.
"""
scores = model.predict([(reference_text, model_answer)])
# Labels: 0=contradiction, 1=entailment, 2=neutral
label_map = {0: "contradiction", 1: "entailment", 2: "neutral"}
label_idx = int(scores[0].argmax())
return {
"label": label_map[label_idx],
"scores": {
"contradiction": float(scores[0][0]),
"entailment": float(scores[0][1]),
"neutral": float(scores[0][2]),
},
"hallucination_risk": "HIGH" if label_idx == 0 else "LOW" if label_idx == 1 else "MEDIUM"
}
# Example usage
reference = (
"Air Canada's bereavement fare discount policy was discontinued in 2014. "
"The airline no longer offers reduced fares for customers traveling due to a family death."
)
hallucinated_answer = (
"Air Canada offers a bereavement discount of up to 50% for passengers "
"who need to travel urgently due to a family member's death."
)
grounded_answer = (
"Air Canada does not currently offer bereavement fare discounts. "
"This policy was discontinued in 2014."
)
print("=== Hallucinated Answer ===")
print(detect_hallucination(reference, hallucinated_answer))
print("\n=== Grounded Answer ===")
print(detect_hallucination(reference, grounded_answer))
Pattern 2: RAG-Based Mitigation with a Vector Store
The simplest reliable mitigation is retrieval-augmented generation: fetch the ground-truth document first, inject it into the prompt, and instruct the model to only answer from the provided context. This pattern uses chromadb for local vector storage and sentence-transformers for embeddings.
"""
Pattern 2: RAG mitigation β ground the LLM answer in retrieved documents.
Requires: pip install chromadb sentence-transformers openai
"""
import chromadb
from chromadb.utils import embedding_functions
import openai
# Set up a local ChromaDB collection
client = chromadb.Client()
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
name="policy_docs",
embedding_function=ef
)
# Index your ground-truth documents
documents = [
"Air Canada's bereavement fare discount policy was discontinued in 2014. "
"The airline does not offer reduced fares for bereavement travel.",
"Air Canada's refund policy allows full refunds within 24 hours of booking "
"for tickets purchased at least 7 days before departure.",
"Air Canada Altitude status can be earned on flights operated by Air Canada, "
"Air Canada Rouge, and select Star Alliance partners.",
]
collection.add(
documents=documents,
ids=["doc1", "doc2", "doc3"]
)
def rag_answer(user_query: str, n_results: int = 2) -> str:
"""Retrieve relevant docs and generate a grounded answer."""
# Step 1: Retrieve
results = collection.query(query_texts=[user_query], n_results=n_results)
context_chunks = "\n".join(results["documents"][0])
# Step 2: Augment and generate
prompt = f"""You are a helpful airline assistant. Answer the user's question
ONLY using the information in the provided context.
If the context does not contain enough information to answer,
say "I don't have that information β please contact Air Canada directly."
Context:
{context_chunks}
Question: {user_query}
Answer:"""
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.0, # Low temperature reduces creative extrapolation
)
return response.choices[0].message.content
# Example
query = "Does Air Canada offer a discount for bereavement travel?"
print(rag_answer(query))
# Expected: Grounded answer stating the policy was discontinued
Pattern 3: SelfCheckGPT-Style Consistency Sampling
The key insight of SelfCheckGPT (Manakul et al., 2023) is that factual claims will be stated consistently across multiple independent samples from the same model, while hallucinations will vary or contradict across samples. This approach requires no external knowledge base.
"""
Pattern 3: Consistency sampling for hallucination detection (SelfCheckGPT-style).
Requires: pip install openai sentence-transformers
"""
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import openai
def sample_responses(prompt: str, n_samples: int = 5) -> list[str]:
"""Generate multiple independent responses at high temperature."""
responses = []
for _ in range(n_samples):
resp = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=1.0, # High temperature increases variance
)
responses.append(resp.choices[0].message.content)
return responses
def consistency_score(responses: list[str]) -> dict:
"""
Compute pairwise semantic similarity across all sampled responses.
High consistency (score near 1.0) β likely factual.
Low consistency (score near 0.0) β likely hallucinated.
"""
encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = encoder.encode(responses)
# Compute all pairwise cosine similarities
sim_matrix = cosine_similarity(embeddings)
# Exclude diagonal (self-similarity = 1.0)
n = len(responses)
off_diagonal = [
sim_matrix[i][j]
for i in range(n)
for j in range(n)
if i != j
]
mean_sim = float(np.mean(off_diagonal))
return {
"mean_consistency": round(mean_sim, 3),
"hallucination_risk": (
"LOW" if mean_sim > 0.85
else "MEDIUM" if mean_sim > 0.65
else "HIGH"
),
"sample_count": n,
}
# Example β factual question (should be consistent)
factual_prompt = "What is the capital of France?"
factual_responses = sample_responses(factual_prompt)
print("Factual question consistency:", consistency_score(factual_responses))
# Example β hallucination-prone question (will be inconsistent)
hallucination_prompt = (
"Cite three specific court cases from 2022 where airlines were held liable "
"for chatbot misinformation in the United States."
)
hallucination_responses = sample_responses(hallucination_prompt)
print("Hallucination-prone question consistency:", consistency_score(hallucination_responses))
π― Worked Example: From Hallucinated Claim to Grounded Answer
This section traces a single query through the full detection-and-mitigation pipeline to make the abstract concrete.
Scenario: A legal research assistant is asked about a precedent case.
Step 1 β Raw LLM response (no grounding):
Prompt: "What did the court decide in Henderson v. United Airlines, 9th Cir. 2021?"
LLM Answer: "In Henderson v. United Airlines, the Ninth Circuit held that airlines bear strict liability for chatbot-generated misinformation provided to passengers, establishing a precedent under 49 U.S.C. Β§ 40101 that electronic communications from airline systems carry the same legal weight as written contracts."
The case name sounds plausible. The statute number is real (it's the general aviation purposes statute, though it doesn't establish the described liability). The ruling is invented.
Step 2 β Consistency sampling detects the problem:
Running Pattern 3 on this prompt with five samples produces five completely different "rulings" β one says strict liability, one says contributory negligence, one says the case was dismissed on standing. Mean consistency score: 0.31 β HIGH hallucination risk.
Step 3 β NLI detection against a legal database excerpt:
A cross-encoder run against the actual Ninth Circuit docket (which contains no record of this case) produces label: contradiction, hallucination_risk: HIGH.
Step 4 β Mitigation via RAG + citation requirement:
The same query, re-run through the RAG pipeline with a verified legal database, returns:
"I was unable to find a case called Henderson v. United Airlines in the Ninth Circuit records for 2021. If you have a citation number, I can search more specifically. Please verify case citations through Westlaw or LexisNexis before including them in any legal filing."
This is the correct outcome: honest uncertainty rather than confident fabrication.
π¬ Detection Techniques in Depth
Detection approaches span a spectrum from cheap-and-approximate to expensive-and-reliable.
Consistency Sampling (Self-Consistency Check)
Sample the same query multiple times at elevated temperature. If the model gives consistent answers across all samples, confidence in factual accuracy is higher. If answers diverge significantly, the claim is likely hallucinated. Best for: factual claims where no external reference is available. Limitation: multiplies API cost by sample count; high temperature may introduce noise on genuinely ambiguous questions.
NLI-Based Entailment Checking
Feed the (context, answer) pair to a Natural Language Inference model. NLI models are trained to classify whether a hypothesis is entailed, neutral, or contradicted by a premise. When the answer is a hypothesis and the retrieved document is the premise, a contradiction score signals a faithfulness hallucination. Best for: RAG pipelines where retrieved context exists. Models to use: cross-encoder/nli-deberta-v3-small, facebook/bart-large-mnli.
Fact Verification Pipelines
Decompose the LLM output into atomic claims, then verify each claim against a structured knowledge base or via a dedicated fact-checking model. Google's SAFE (Search-Augmented Factuality Evaluator) does this by issuing Google Search queries for each claim and checking agreement. Best for: high-stakes production systems (medical, legal). Limitation: expensive per query; requires a reliable knowledge base.
Retrieval-Augmented Verification (RAV)
The inverse of RAG: instead of retrieving before generation, you retrieve after generation to verify. Take the model's output, extract key claims, retrieve evidence for each claim, and score entailment. RAGAS's faithfulness metric works this way.
π‘οΈ Mitigation Strategies That Actually Work in Production
Retrieval-Augmented Generation (RAG)
Inject verified documents into the prompt before generation. The model answers from the provided context rather than parametric memory. This is the single most effective mitigation for factual hallucinations in domain-specific applications.
Key implementation detail: Include an explicit instruction like "If the answer is not in the provided context, say you don't know." Without this, models will use the context as a starting point but still extrapolate beyond it.
System Prompt Grounding
Instruct the model explicitly to cite sources, express uncertainty, and refuse to speculate. Phrases like "Only state facts you are highly confident in. If uncertain, say so explicitly." reduce hallucinations meaningfully β but don't eliminate them. Use as a layer, not as a primary defense.
Chain-of-Thought with Citations
Ask the model to reason step-by-step and cite the source for each claim. "For each fact you state, identify whether it comes from the provided context or from your training data, and flag training-data claims as unverified." This surface-level citation check is fast and catches many faithfulness hallucinations.
Fine-Tuning on Factual Data with Uncertainty Labels
Fine-tune the model on examples that include "I don't know" or "I'm not certain" responses for out-of-distribution questions. This calibrates the model to express appropriate uncertainty rather than defaulting to confident extrapolation. Effective but expensive β requires curated datasets and GPU time.
Output Validation Pipelines
Run model outputs through a validation layer before serving to users. This layer can include NLI scoring, consistency checks, or keyword-based heuristics (e.g., block responses that contain the words "In [year]," followed by a date beyond the knowledge cutoff). Pragmatic, fast, but brittle β good as a last line of defense.
The diagram below shows how these mitigations layer in a production LLM application.
flowchart LR
A[User Query] --> B[RAG Retrieval Layer]
B --> C[Augmented Prompt with Context]
C --> D[LLM Generation]
D --> E[Output Validation Layer]
E --> F{NLI Score above threshold?}
F -->|Yes - grounded| G[Serve Answer to User]
F -->|No - contradiction detected| H[Fallback: I cannot verify this]
D --> I[Consistency Sampling Check]
I --> J{High consistency across samples?}
J -->|Yes| G
J -->|No - divergent outputs| H
This diagram shows a defense-in-depth architecture. Retrieval happens before generation (RAG), then the generated output is passed through two parallel validation checks: NLI entailment scoring against the retrieved context, and consistency sampling to catch open-domain hallucinations that have no external reference to check against. Only answers that pass both gates reach the user; everything else routes to a safe fallback message.
π How Production Systems at Scale Handle Hallucination
Google Search Generative Experience (SGE) and Gemini
Google built source citations directly into its generative search answers, with inline links to the specific documents that ground each claim. Every factual sentence is anchored to a retrieved source, and the UI makes this visible to users so they can verify claims independently. The engineering insight: transparency about provenance is itself a mitigation β it shifts verification responsibility to the user for low-stakes claims while making fabrication visible.
Perplexity AI
Perplexity's entire UX is built around grounded generation: every answer includes numbered citations, and the system uses real-time web search retrieval before generation. It also distinguishes between "I found this in a search result" and inferences the model makes beyond retrieved content. The architecture is essentially a RAG pipeline with aggressive citation requirements baked into the system prompt.
Bing AI (Microsoft Copilot)
Microsoft added a "grounded generation" mode after early hallucination incidents. The system explicitly presents a "learning from web" indicator and includes sourced citations. Copilot also applies consistency checks across retrieved sources β if two top-ranked results contradict each other, it flags the disagreement rather than arbitrarily picking one.
Enterprise Deployments
In regulated industries (finance, healthcare, legal), enterprises add a human-in-the-loop review gate for any output that scores below a confidence threshold on automated checks. High-stakes outputs β a medical recommendation, a legal filing, a financial advice paragraph β never reach users without a domain expert reviewing the flagged sections. This is expensive but necessary given current model reliability in high-stakes domains.
βοΈ Trade-offs: The Honest Engineering Calculus
No mitigation strategy is free. Here is what you actually pay for each:
| Strategy | Latency Impact | Cost | Reliability Gain | Best Fit |
| RAG | +100β400ms retrieval | Low (vector DB + embedding) | High for domain-specific | Customer service, knowledge bases |
| Consistency Sampling | +N Γ API call time | High (5Γ LLM cost) | Medium for factual | High-stakes Q&A, no KB available |
| NLI Validation | +30β80ms | Low (small model) | High for faithfulness | Any RAG pipeline |
| Fine-tuning | 0ms at inference | Very High (training compute) | High, but model-specific | Repeated high-stakes use case |
| System Prompt Grounding | 0ms | Zero | LowβMedium (unreliable alone) | Always β as baseline layer |
| Human Review | Minutes to hours | Very High (labor) | Very High | Medical, legal, financial |
The key insight engineers miss: These strategies are not mutually exclusive. The most robust production systems stack them. System prompt grounding is free, so always apply it. RAG eliminates the largest class of factual hallucinations for domain-specific apps. NLI validation is cheap and catches faithfulness errors that RAG doesn't prevent. Consistency sampling is reserved for queries where no KB exists and the cost can be justified by the stakes.
π§ Choosing Your Mitigation Strategy: A Use-Case Decision Guide
| Use Case | Hallucination Risk | Recommended Strategy | Avoid |
| Customer service chatbot with internal KB | Medium β knowledge gaps | RAG + NLI validation | Relying on base model alone |
| Medical information assistant | Very High β life-safety stakes | RAG + human review + fine-tuning | Consistency sampling as sole check |
| Legal research assistant | Very High β professional liability | RAG + citation requirements + human review | Open-ended generation without grounding |
| General-purpose Q&A (no KB) | High β unbounded domain | Consistency sampling + system prompt uncertainty | Serving single-sample outputs directly |
| Code generation assistant | Medium β functional errors | Output testing (run the code) + NLI on comments | Over-relying on NLI for code verification |
| Summarization of provided documents | LowβMedium β faithfulness errors | NLI validation against source document | No validation at all |
π οΈ LangChain and RAGAS: Hallucination Evaluation in Practice
RAGAS (Retrieval-Augmented Generation Assessment) is the most widely adopted open-source library for evaluating LLM pipelines on hallucination-related metrics. It was built specifically to measure the two hallucination types that matter most in RAG systems: faithfulness (does the answer stay within the retrieved context?) and answer relevancy (does the answer address the actual question?).
RAGAS computes faithfulness by decomposing the answer into atomic claims, then checking each claim's entailment against the retrieved context using an LLM-as-judge approach. An answer score of 1.0 means every claim in the answer is fully supported by the context. A score below 0.7 is a red flag that the model is extrapolating beyond what was retrieved.
"""
RAGAS evaluation: measure faithfulness and answer relevancy of a RAG pipeline.
Requires: pip install ragas langchain-openai datasets
"""
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset
# Evaluation dataset: questions, answers, contexts, and ground truths
data = {
"question": [
"Does Air Canada offer bereavement fares?",
"What is Air Canada's refund policy?",
],
"answer": [
# Answer 1: hallucinated β contradicts the context
"Yes, Air Canada offers a 50% bereavement discount for immediate family.",
# Answer 2: grounded β faithful to the context
"Air Canada allows full refunds within 24 hours of booking for tickets "
"purchased at least 7 days before departure.",
],
"contexts": [
[
"Air Canada's bereavement fare discount policy was discontinued in 2014. "
"The airline does not offer reduced fares for bereavement travel."
],
[
"Air Canada's refund policy allows full refunds within 24 hours of booking "
"for tickets purchased at least 7 days before departure."
],
],
"ground_truth": [
"Air Canada does not offer bereavement fares as the policy was discontinued in 2014.",
"Full refunds are available within 24 hours of booking for tickets booked 7+ days before departure.",
],
}
dataset = Dataset.from_dict(data)
# Run RAGAS evaluation
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy],
)
print(results)
# Faithful answer scores near 1.0; hallucinated answer scores near 0.0 on faithfulness
The key output to watch is faithfulness per row. A hallucinated answer that contradicts the provided context should score below 0.3; a grounded, accurate answer should score above 0.85. Use this as your quality gate in a CI/CD pipeline β if the faithfulness score drops below threshold after a model update, block the release.
For a full deep-dive on RAGAS and other LLM evaluation frameworks, see LLM Evaluation Frameworks: RAGAS, DeepEval, TruLens.
π What Engineers Get Wrong When Fighting Hallucinations
Mistake 1: Treating all LLM errors as hallucinations.
Not every wrong answer is a hallucination. If a model misunderstands an ambiguous prompt and answers a different question than you asked, that's a prompt design problem, not a hallucination. If the model reasons incorrectly through a math problem, that's a reasoning failure. Conflating these prevents you from applying the right fix.
Mistake 2: Ignoring the faithfulness vs. factual distinction.
Engineers who add RAG and declare "hallucinations solved" often miss that faithfulness hallucinations persist even after perfect retrieval. The model has the right document in context and still says something the document doesn't say. Always add NLI-based validation on top of RAG β retrieval accuracy and answer faithfulness are independently testable properties.
Mistake 3: Using temperature = 0 as a hallucination fix.
Lowering temperature makes outputs deterministic, not accurate. A model at temperature 0.0 will confidently return the same wrong answer every single time β with perfect repeatability. Temperature controls variety, not factuality.
Mistake 4: Optimizing for BLEU / ROUGE without checking faithfulness.
BLEU and ROUGE measure lexical overlap with reference answers. A model that fabricates plausible-sounding text that shares tokens with a reference answer can score well on both metrics while hallucinating the actual substance. Use semantic metrics (BERTScore, faithfulness via RAGAS) in addition to, not instead of, lexical metrics.
Mistake 5: Expecting fine-tuning to eliminate hallucinations permanently.
Fine-tuning on domain data reduces hallucinations for in-distribution queries. But it also shifts the distribution of what the model memorizes, which can introduce new hallucinations at the margins of the new training data. Fine-tuning is a starting point, not a terminal solution. Ongoing monitoring is non-negotiable.
π TLDR & Key Takeaways
- Hallucinations are structural, not accidental. They arise directly from the next-token prediction objective, which maximizes fluency rather than factual accuracy.
- There are three types that need different fixes: factual hallucinations (fix with RAG), faithfulness hallucinations (fix with NLI validation), and open-domain hallucinations (detect with consistency sampling).
- RAG is the most cost-effective mitigation for domain-specific applications, but it must be paired with an explicit "answer only from context" instruction and NLI validation to catch faithfulness drift.
- Consistency sampling detects hallucinations without a knowledge base by exploiting the fact that true facts are stable across samples while fabricated claims are not.
- RAGAS gives you a quantitative faithfulness score that can gate CI/CD pipelines β block releases when faithfulness drops below threshold.
- Temperature 0 β no hallucinations. Determinism and accuracy are orthogonal properties.
- Defense in depth wins: layer system prompt grounding (free) + RAG (high ROI) + NLI validation (cheap) + human review for high-stakes outputs.
- The Air Canada chatbot incident demonstrated that courts will hold companies responsible for what their LLM systems assert β engineering for hallucination safety is a legal and business requirement, not just a quality preference.
π Practice Quiz
A RAG pipeline retrieves the correct document for a user query, but the LLM's answer includes a fact that is not present in the retrieved document. What type of hallucination is this?
- A) Factual hallucination
- B) Faithfulness hallucination
- C) Retrieval failure
- D) Prompt injection Correct Answer: B
You are building a legal research assistant where hallucinated case citations carry serious professional liability. Which mitigation stack is most appropriate?
- A) System prompt grounding only β low cost and fast
- B) Temperature = 0 to ensure deterministic outputs
- C) RAG from a verified legal database + citation requirements + human review for flagged outputs
- D) Consistency sampling with 3 samples as the sole check Correct Answer: C
The SelfCheckGPT consistency sampling approach works by:
- A) Comparing the LLM output to a fine-tuned fact-checking model
- B) Running multiple samples at high temperature and measuring semantic similarity across them
- C) Retrieving documents before generation and checking entailment after
- D) Filtering outputs using a BLEU score threshold against reference answers Correct Answer: B
Open-ended: Your team is deploying a medical information chatbot and has decided to use RAGAS faithfulness scores as an automated quality gate in the CI/CD pipeline. A colleague argues that a faithfulness score above 0.8 is sufficient and no human review is needed for any output. What are the failure modes in this approach, and how would you push back?
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF
TLDR: A pretrained LLM is a generalist. Fine-tuning makes it a specialist. Supervised Fine-Tuning (SFT) teaches it your domain's language through labeled examples. LoRA does the same with 99% fewer tr

Chain of Thought Prompting: Teaching LLMs to Think Step by Step
TLDR: Chain of Thought (CoT) prompting tells a language model to reason out loud before answering. By generating intermediate steps, the model steers itself toward correct conclusions β turning guessw

Transfer Learning Explained: Standing on the Shoulders of Pretrained Models
TLDR: You don't need millions of labeled images or months of GPU time to build a great model. Transfer learning lets you borrow a pretrained network's hard-won feature detectors, plug in a new output
