LLM Observability: Tracing, Logging, and Debugging Production AI Systems
How to Monitor Non-Deterministic AI Systems with LangSmith, OpenTelemetry, and LangFuse
Abstract AlgorithmsTLDR: π LLM observability is radically different from traditional APMβnon-deterministic outputs, variable token costs, and multi-step reasoning chains require specialized tracing. LangSmith provides native LangChain integration, OpenTelemetry offers standardization, and LangFuse delivers open-source flexibility. The key: instrument every prompt, capture every token, and alert on cost spikes before they hit your budget.
Your LLM app passed all your evals. In production, 15% of users are getting hallucinated answers and your costs are 3x what you budgeted. You have no idea why because you have no visibility into what the model is actually doing.
Welcome to the unique hell of debugging non-deterministic systems. Unlike traditional web services where HTTP 500 errors are binary failures, LLM applications fail quietly with plausible-sounding nonsense. A user asks for "Python sorting algorithms" and gets a perfectly formatted response about JavaScript promises. Your metrics show 200 OK, but your users get garbage.
This is why traditional Application Performance Monitoring (APM) tools fall short with LLMs. You need specialized observability that captures prompts, tokens, reasoning chains, and the probabilistic nature of AI outputs.
π The LLM Observability Challenge: Why Traditional APM Falls Short
Traditional monitoring assumes deterministic systems. Give the same input, get the same output, measure latency and error rates. LLMs shatter these assumptions:
Non-deterministic outputs mean the same prompt can produce different responses based on temperature settings, model state, or even cosmic rays. You can't just compare response strings to detect anomalies.
Variable token costs make every request different. A simple question might cost 100 tokens, while follow-up reasoning explodes to 10,000 tokens. Your cost per request varies by 100x based on user behavior.
Multi-step reasoning chains create complex execution flows. An agent might query a database, call three APIs, perform web searches, and synthesize results. Traditional request tracing captures the HTTP calls but misses the reasoning steps.
Context stuffing happens silently. Your RAG system retrieves 50 documents, but only uses 3. You're paying for 47 irrelevant chunks without knowing it.
Prompt drift occurs as dynamic templates change based on user data. The same logical query generates different prompts, making it impossible to track performance over time.
Here's what traditional metrics miss:
| Traditional APM | LLM Reality |
| Response time: 200ms | Time-to-first-token vs. total generation time |
| Error rate: 2% HTTP 500s | Hallucination rate: 15% plausible but wrong |
| Throughput: 1000 RPS | Token throughput: variable by 100x per request |
| Cost: predictable per request | Cost: $0.001 to $1.00+ per request |
π The Three Pillars of LLM Observability
LLM observability rests on three pillars that extend traditional monitoring:
Traces: Full Prompt-to-Response Journeys
Unlike HTTP request traces, LLM traces capture:
- Prompt templates with variable substitution
- Context retrieval and document ranking
- Multi-step reasoning chains in agents
- Tool usage and API calls made by the model
- Output parsing and validation steps
Each trace shows the complete decision path, not just network hops.
Metrics: Token-Aware Performance Indicators
Key metrics specific to LLMs:
| Metric | Why It Matters | Alert Threshold |
| Token usage per request | Cost control | 90th percentile > 5000 tokens |
| Time-to-first-token (TTFT) | User experience | > 2 seconds |
| Total latency | End-to-end experience | > 30 seconds |
| Cost per session | Budget burn rate | > $0.50 per session |
| Hallucination rate | Quality degradation | > 5% of responses |
| Context utilization | Efficiency | < 30% of retrieved docs used |
Logs: Structured Events with LLM Context
Traditional logs capture HTTP status codes. LLM logs need:
- Prompt construction events
- Model selection and routing decisions
- Context retrieval results with relevance scores
- Tool execution outcomes
- Output validation failures
Each log entry includes the full conversational context to enable debugging.
βοΈ How LangSmith Traces Every Token and Decision in Your AI Pipeline
LangSmith is LangChain's native observability platform, designed specifically for LLM applications. It captures the full execution graph of chains, agents, and tools with zero configuration.
Automatic Instrumentation
LangSmith automatically traces all LangChain components:
import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langsmith import traceable
# Set LangSmith API key - no other config needed
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_key"
# This chain is automatically traced
llm = ChatOpenAI(temperature=0.7)
prompt = ChatPromptTemplate.from_template(
"You are a helpful assistant. Answer this question: {question}"
)
chain = LLMChain(llm=llm, prompt=prompt)
# Every invoke() call creates a trace
response = chain.invoke({"question": "What is quantum computing?"})
LangSmith captures:
- Prompt template with variables
- Token counts (input and output)
- Model parameters (temperature, max_tokens)
- Execution time (TTFT and total)
- Response content and metadata
Manual Instrumentation with @traceable
For custom functions outside LangChain:
from langsmith import traceable
import requests
@traceable
def fetch_context_documents(query: str, top_k: int = 5) -> list:
"""Retrieve relevant documents for RAG context."""
# Custom retrieval logic
response = requests.post("http://vector-db:8000/search", {
"query": query,
"limit": top_k
})
docs = response.json()["documents"]
# LangSmith automatically captures inputs, outputs, and metadata
return docs
@traceable
def synthesize_response(question: str, context_docs: list) -> str:
"""Generate response using retrieved context."""
context = "\n".join([doc["content"] for doc in context_docs])
prompt = f"""
Context: {context}
Question: {question}
Answer based only on the context provided.
"""
# This LLM call is also traced automatically
llm = ChatOpenAI(temperature=0.1)
response = llm.invoke(prompt)
return response.content
# Usage creates nested traces
def answer_question(user_question: str):
docs = fetch_context_documents(user_question)
answer = synthesize_response(user_question, docs)
return answer
RunTree for Complex Workflows
For maximum control over tracing:
from langsmith import Client
from langsmith.run_trees import RunTree
client = Client()
def process_user_query(question: str, user_id: str):
# Create parent trace
with RunTree(
name="user_query_processing",
run_type="chain",
inputs={"question": question, "user_id": user_id},
client=client
) as parent_run:
# Step 1: Query classification
with parent_run.create_child(
name="classify_intent",
run_type="llm"
) as classify_run:
intent = classify_query_intent(question)
classify_run.end(outputs={"intent": intent})
# Step 2: Context retrieval
with parent_run.create_child(
name="retrieve_context",
run_type="retriever"
) as retrieval_run:
docs = fetch_relevant_docs(question, intent)
retrieval_run.end(outputs={
"num_docs": len(docs),
"relevance_scores": [d["score"] for d in docs]
})
# Step 3: Response generation
with parent_run.create_child(
name="generate_response",
run_type="llm"
) as generation_run:
response = generate_answer(question, docs)
generation_run.end(outputs={"response": response})
parent_run.end(outputs={"final_answer": response})
LangSmith's dashboard shows the complete execution tree with timing, costs, and intermediate outputs for every step.
π§ Deep Dive: OpenTelemetry for Standardized LLM Instrumentation
The Internals
OpenTelemetry (OTel) provides vendor-neutral observability for LLM applications through standardized spans and metrics. The LLM semantic conventions define specific attributes for AI workloads:
Span Structure:
operation.name: "llm.completion"
llm.vendor: "openai"
llm.model.name: "gpt-4"
llm.model.version: "2024-02-15-preview"
llm.temperature: 0.7
llm.max_tokens: 1000
llm.token_count.prompt: 156
llm.token_count.completion: 89
llm.latency.time_to_first_token: 1.2s
Memory Layout: OTel stores traces in memory as linked span objects, with each span containing attributes, events, and links to parent/child spans. When memory limits are reached, spans are exported to backends like Jaeger or DataDog.
State Management: The tracer maintains active span context using thread-local storage or async context variables, enabling automatic parent-child relationships across async operations.
Performance Analysis
Time Complexity: O(1) for span creation and attribute setting. The tracer uses hash tables for span lookup and linked lists for span relationships.
Space Complexity: O(n) where n is the number of active spans. Each span stores ~1KB of metadata plus variable attribute data.
Bottlenecks:
- Span export becomes the limiting factor at scale (1000+ spans/second)
- Attribute serialization for complex objects (embeddings, large prompts)
- Network I/O to observability backends during export batches
Mitigation: Use async exporters, compress span data, and sample high-volume operations.
π Visualizing LLM Request Flows with Distributed Tracing
LLM applications create complex execution graphs that traditional request tracing can't capture. Here's how a typical RAG agent execution looks with proper instrumentation:
graph TD
A[User Question] --> B[Intent Classification]
B --> C[Vector Search]
B --> D[Knowledge Graph Query]
C --> E[Document Ranking]
D --> F[Entity Resolution]
E --> G[Context Assembly]
F --> G
G --> H[Prompt Construction]
H --> I[LLM Generation]
I --> J[Output Validation]
J --> K{Valid Response?}
K -->|Yes| L[Response Formatting]
K -->|No| M[Fallback Generation]
M --> N[Secondary LLM Call]
N --> L
L --> O[User Response]
style B fill:#e1f5fe
style C fill:#e8f5e8
style D fill:#e8f5e8
style I fill:#fff3e0
style N fill:#fff3e0
style J fill:#fce4ec
Each node in this graph becomes a span in your distributed trace, with the following key attributes:
| Span Type | Key Attributes | Example Values |
| Intent Classification | llm.prompt, llm.tokens.input, classification.confidence | "Classify: 'What is RAG?'", 23, 0.94 |
| Vector Search | search.query, search.results.count, search.similarity.threshold | "RAG retrieval", 15, 0.7 |
| LLM Generation | llm.model, llm.tokens.total, llm.cost.usd | "gpt-4-turbo", 1247, $0.031 |
| Output Validation | validation.rules, validation.passed, validation.errors | ["factual", "relevant"], true, [] |
π Real-World Debugging: From Prompt Drift to Cost Explosions
Case Study 1: The Hallucinating Support Bot
Situation: A customer support chatbot started giving incorrect refund policies, causing compliance issues.
Investigation with LangSmith:
- Trace Analysis: Filtered traces by "refund" keyword, found 23% contained hallucinated policy details
- Prompt Inspection: Discovered the knowledge base retrieval was returning outdated documents
- Context Quality: Document relevance scores averaged 0.4 (below 0.7 threshold)
Root Cause: The vector database hadn't been updated with new policies, so context retrieval failed silently.
Solution: Added context quality monitoring and alerts when average relevance drops below 0.6.
@traceable
def validate_context_quality(retrieved_docs: list, threshold: float = 0.6):
avg_score = sum(doc["relevance_score"] for doc in retrieved_docs) / len(retrieved_docs)
if avg_score < threshold:
# Log warning and trigger alert
logger.warning(f"Context quality below threshold: {avg_score:.2f}")
send_slack_alert(f"RAG context quality degraded to {avg_score:.2f}")
return avg_score
Case Study 2: Token Cost Explosion
Situation: Daily LLM costs jumped from $500 to $3,000 overnight with no change in user volume.
Investigation with OpenTelemetry Metrics:
# Cost tracking with OTel metrics
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
cost_counter = meter.create_counter(
"llm_cost_total_usd",
description="Total LLM API costs"
)
token_histogram = meter.create_histogram(
"llm_tokens_per_request",
description="Token usage distribution"
)
def track_llm_costs(model_name: str, input_tokens: int, output_tokens: int):
# OpenAI GPT-4 pricing: $0.03/1K input, $0.06/1K output
input_cost = (input_tokens / 1000) * 0.03
output_cost = (output_tokens / 1000) * 0.06
total_cost = input_cost + output_cost
# Record metrics with attributes
cost_counter.add(total_cost, {"model": model_name, "cost_type": "api_usage"})
token_histogram.record(input_tokens + output_tokens, {"model": model_name})
Root Cause Analysis: The 95th percentile token usage jumped from 2,000 to 15,000 tokens per request. Tracing revealed that a code generation feature was including entire codebases as context.
Solution: Implemented token budget limits and smart context truncation:
def truncate_context_by_budget(documents: list, token_budget: int = 4000):
"""Keep only the most relevant docs within token budget."""
sorted_docs = sorted(documents, key=lambda x: x["relevance_score"], reverse=True)
total_tokens = 0
selected_docs = []
for doc in sorted_docs:
doc_tokens = count_tokens(doc["content"])
if total_tokens + doc_tokens <= token_budget:
selected_docs.append(doc)
total_tokens += doc_tokens
else:
break
return selected_docs, total_tokens
βοΈ Trade-offs: Native Tools vs. Open Standards vs. Cost
Performance vs. Vendor Lock-in
LangSmith provides the richest LLM-specific features but locks you into the LangChain ecosystem:
Pros:
- Zero-config automatic instrumentation for LangChain
- LLM-specific UI for prompt debugging and dataset comparison
- Built-in evaluation workflows and A/B testing
Cons:
- Vendor lock-in to LangChain architecture
- Additional cost on top of LLM API fees
- Limited customization of trace data structure
OpenTelemetry offers vendor neutrality but requires more setup:
Pros:
- Send traces to any backend (DataDog, New Relic, Grafana)
- Standardized LLM semantic conventions
- No vendor lock-in, full control over data
Cons:
- Manual instrumentation for non-standard LLM workflows
- Generic observability UIs lack LLM-specific debugging features
- More complex setup and configuration
Correctness vs. Cost Trade-offs
Comprehensive tracing captures every prompt and response, but storage costs scale with volume:
| Trace Sampling Rate | Monthly Cost (100K requests) | Debug Capability |
| 100% (all traces) | $200-500 | Full debugging |
| 10% (sample) | $20-50 | Statistical analysis only |
| 1% (errors + sample) | $5-15 | Error debugging only |
Failure Modes:
- Sampling bias: Critical edge cases missed in sampled data
- Storage explosion: Full prompt/response traces consume 10x more space than HTTP logs
- Privacy leaks: Traces contain user data and proprietary prompts
Mitigation Strategies:
# Adaptive sampling based on error rates and costs
def get_sampling_rate(error_rate: float, daily_cost: float) -> float:
if error_rate > 0.05: # High error rate
return 1.0 # Sample everything
elif daily_cost > 1000: # High cost
return 0.1 # Sample 10%
else:
return 0.05 # Standard 5% sampling
π§ Decision Guide: Choosing Your LLM Observability Stack
| Situation | Recommendation |
| Use LangSmith when | Building with LangChain, need rapid prototyping, team < 10 developers, budget allows vendor tooling |
| Use OpenTelemetry when | Multi-vendor setup, existing OTel infrastructure, need custom trace backends, compliance requirements |
| Use LangFuse when | Self-hosted requirements, open-source preference, custom evaluation metrics, cost optimization priority |
| Alternative approaches | DataDog APM + custom LLM metrics, Grafana + Prometheus for metrics-only, ELK stack for log-centric debugging |
| Edge cases | High-security environments (air-gapped deployments), extreme scale (1M+ requests/day), multi-cloud architectures |
π§ͺ Practical Examples: Instrumenting a LangChain Agent with LangSmith
Example 1: Customer Support Agent with Tool Usage
import os
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.tools import Tool
from langsmith import traceable
import requests
# Enable LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_key"
os.environ["LANGCHAIN_PROJECT"] = "customer-support-agent"
@traceable
def search_knowledge_base(query: str) -> str:
"""Search internal knowledge base for support articles."""
response = requests.post("http://kb-api:8000/search", {
"query": query,
"max_results": 3
})
articles = response.json()["articles"]
# LangSmith captures this function's inputs/outputs automatically
return "\n".join([f"Article: {a['title']}\n{a['content']}" for a in articles])
@traceable
def get_order_status(order_id: str) -> str:
"""Fetch current order status from order management system."""
response = requests.get(f"http://orders-api:8000/orders/{order_id}")
order = response.json()
return f"Order {order_id}: Status = {order['status']}, Expected delivery = {order['delivery_date']}"
# Define tools for the agent
tools = [
Tool(
name="search_knowledge_base",
description="Search support articles and documentation",
func=search_knowledge_base
),
Tool(
name="get_order_status",
description="Get current status of customer orders by order ID",
func=get_order_status
)
]
# Create the agent
llm = ChatOpenAI(temperature=0.1, model="gpt-4")
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful customer support agent. Use the available tools to help customers.
Available tools:
- search_knowledge_base: Find relevant support articles
- get_order_status: Check order status by order ID
Always be polite and provide specific, actionable information."""),
("human", "{input}"),
("assistant", "{agent_scratchpad}")
])
agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# Usage - all steps automatically traced in LangSmith
def handle_support_request(customer_message: str, customer_id: str):
with traceable(
name="support_request",
metadata={"customer_id": customer_id}
):
response = agent_executor.invoke({
"input": customer_message
})
return response["output"]
# Example usage
result = handle_support_request(
"Hi, I need help with my order #12345. It was supposed to arrive yesterday.",
customer_id="cust_789"
)
print(result)
LangSmith Dashboard Output:
- Agent Execution Trace: Shows the reasoning steps, tool calls, and final response
- Tool Usage Metrics: Number of API calls, response times, success rates
- Token Consumption: Tracks tokens used in system prompts, tool descriptions, and responses
- Cost Attribution: Breaks down costs by tool usage, model calls, and customer session
Example 2: RAG System with Context Quality Monitoring
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langsmith import traceable
import numpy as np
class ObservableRAG:
def __init__(self, vector_store_path: str):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = Chroma(
persist_directory=vector_store_path,
embedding_function=self.embeddings
)
self.llm = ChatOpenAI(temperature=0.1, model="gpt-4")
@traceable
def retrieve_with_quality_check(self, query: str, k: int = 5) -> dict:
"""Retrieve documents and assess context quality."""
# Get similarity search results with scores
docs_with_scores = self.vectorstore.similarity_search_with_score(query, k=k)
# Calculate context quality metrics
scores = [score for _, score in docs_with_scores]
avg_similarity = np.mean(scores)
min_similarity = min(scores)
score_variance = np.var(scores)
docs = [doc for doc, _ in docs_with_scores]
# Log quality metrics for monitoring
context_quality = {
"avg_similarity": avg_similarity,
"min_similarity": min_similarity,
"score_variance": score_variance,
"num_docs_retrieved": len(docs),
"query_length": len(query.split())
}
# Alert on poor context quality
if avg_similarity < 0.6:
self._log_quality_alert("Low average similarity", context_quality)
if min_similarity < 0.3:
self._log_quality_alert("Very low minimum similarity", context_quality)
return {
"documents": docs,
"quality_metrics": context_quality,
"similarity_scores": scores
}
@traceable
def _log_quality_alert(self, alert_type: str, metrics: dict):
"""Log context quality alerts for monitoring."""
print(f"ALERT: {alert_type} - Metrics: {metrics}")
# In production, send to monitoring system
@traceable
def generate_answer(self, query: str) -> dict:
"""Generate answer with full observability."""
# Step 1: Retrieve and assess context
retrieval_result = self.retrieve_with_quality_check(query)
docs = retrieval_result["documents"]
quality = retrieval_result["quality_metrics"]
# Step 2: Build context-aware prompt
context = "\n\n".join([doc.page_content for doc in docs])
prompt = f"""
Use the following context to answer the question. If the context doesn't contain
relevant information, say so explicitly.
Context:
{context}
Question: {query}
Answer:
"""
# Step 3: Generate response
response = self.llm.invoke(prompt)
# Step 4: Return with full metadata
return {
"answer": response.content,
"context_quality": quality,
"num_context_chars": len(context),
"source_documents": len(docs)
}
# Usage with automatic LangSmith tracing
rag = ObservableRAG("./chroma_db")
result = rag.generate_answer("What are the key principles of microservices architecture?")
print(f"Answer: {result['answer']}")
print(f"Context Quality: {result['context_quality']}")
Monitoring Dashboard Insights:
- Context Quality Trends: Track average similarity scores over time to detect knowledge base drift
- Retrieval Performance: Monitor query types that produce low-quality contexts
- Cost Per Question: Understand token usage patterns by question complexity
- Answer Quality Correlation: Compare context quality with user satisfaction ratings
π οΈ LangFuse: Open-Source Alternative for Self-Hosted Observability
LangFuse provides comprehensive LLM observability without vendor lock-in, offering features comparable to LangSmith with full self-hosting capabilities.
Key Features:
- Distributed tracing for multi-step LLM workflows
- Cost tracking with granular token-level attribution
- A/B testing and evaluation workflows
- Dataset management for prompt engineering
- Team collaboration with shared dashboards
Self-Hosted Setup:
# Install LangFuse SDK
# pip install langfuse
import os
from langfuse import Langfuse
from langfuse.decorators import observe
# Initialize self-hosted LangFuse instance
langfuse = Langfuse(
host="https://your-langfuse-instance.com",
public_key="pk_your_public_key",
secret_key="sk_your_secret_key"
)
@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4") -> str:
"""LLM call with automatic LangFuse tracing."""
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
# LangFuse automatically captures:
# - Input prompt
# - Model parameters
# - Token usage
# - Response content
# - Timing information
return response.choices[0].message.content
@observe(as_type="span")
def process_user_query(query: str) -> dict:
"""Main processing pipeline with nested tracing."""
# Step 1: Intent classification
intent_prompt = f"Classify this user query: {query}"
intent = call_llm(intent_prompt, model="gpt-3.5-turbo")
# Step 2: Generate response based on intent
if "technical" in intent.lower():
response_prompt = f"Provide a technical answer to: {query}"
response = call_llm(response_prompt, model="gpt-4")
else:
response_prompt = f"Provide a simple answer to: {query}"
response = call_llm(response_prompt, model="gpt-3.5-turbo")
return {
"intent": intent,
"response": response,
"model_used": "gpt-4" if "technical" in intent.lower() else "gpt-3.5-turbo"
}
# Usage creates nested traces in LangFuse
result = process_user_query("How does neural attention work in transformers?")
Cost Analysis Dashboard:
# Custom cost tracking with LangFuse events
@observe()
def track_session_cost(user_id: str, session_id: str, total_tokens: int, model: str):
"""Track costs at session level for budgeting."""
# Model pricing (tokens per USD)
pricing = {
"gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
"gpt-3.5-turbo": {"input": 0.001, "output": 0.002}
}
# Estimate cost (simplified)
cost_per_1k = pricing[model]["input"] # Assume mostly input tokens
estimated_cost = (total_tokens / 1000) * cost_per_1k
# Log to LangFuse for cost analytics
langfuse.event(
name="session_cost_tracking",
properties={
"user_id": user_id,
"session_id": session_id,
"total_tokens": total_tokens,
"model": model,
"estimated_cost_usd": estimated_cost
}
)
return estimated_cost
LangFuse vs. LangSmith Comparison:
| Feature | LangFuse | LangSmith |
| Hosting | Self-hosted or cloud | LangChain cloud only |
| Cost | Free (self-hosted) + infrastructure | $20-200/month per project |
| Data Control | Full control, on-premise | Data stored with LangChain |
| LangChain Integration | Manual instrumentation | Automatic tracing |
| Custom Metrics | Full flexibility | Predefined LLM metrics |
| Team Features | Open-source collaboration | Built-in team management |
π Lessons Learned from Production LLM Monitoring
Key Insights from Real Deployments
Token Budget Monitoring is Critical: One startup burned through their entire monthly OpenAI budget in 3 days because a recursive agent got stuck in a reasoning loop. Always implement token limits per session.
Context Quality Degrades Silently: Vector databases can drift as content changes, but retrieval still returns "relevant" documents with lower similarity scores. Set up automated alerts when average similarity drops below acceptable thresholds.
Temperature Matters More Than You Think: A chatbot with temperature=0.9 produced creative but factually incorrect responses. Users complained about "hallucinations" that were actually intended randomness. Monitor temperature settings per use case.
Common Pitfalls to Avoid
Over-instrumenting Low-Value Calls: Tracing every internal function call creates noise. Focus on user-facing operations, tool usage, and model calls only.
Ignoring Async Context Propagation: In Python async environments, trace context can be lost between awaits. Use proper context managers:
import asyncio
from contextvars import copy_context
async def async_llm_call():
# Wrong: context lost in async calls
await some_async_operation()
# Right: preserve trace context
ctx = copy_context()
await ctx.run(some_async_operation)
Storing PII in Traces: User data in prompts becomes searchable in observability tools. Sanitize sensitive data:
def sanitize_prompt(prompt: str) -> str:
"""Remove PII before logging."""
import re
# Remove email patterns
prompt = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', prompt)
# Remove phone patterns
prompt = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', prompt)
# Remove credit card patterns
prompt = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', prompt)
return prompt
Best Practices for Implementation
Start Simple, Scale Complex: Begin with basic request/response logging, then add token tracking, then full distributed tracing. Don't boil the ocean on day one.
Alert on Business Impact: Technical metrics like latency matter, but business metrics like hallucination rate and customer satisfaction matter more. Correlate them:
def calculate_user_satisfaction_score(session_id: str) -> float:
"""Correlate technical metrics with user feedback."""
# Get technical metrics
session_traces = langfuse.get_session_traces(session_id)
avg_latency = calculate_avg_latency(session_traces)
token_efficiency = calculate_token_efficiency(session_traces)
# Get user feedback
feedback = get_user_feedback(session_id) # thumbs up/down
# Store correlation for analysis
langfuse.event(
name="satisfaction_correlation",
properties={
"session_id": session_id,
"avg_latency_ms": avg_latency,
"token_efficiency": token_efficiency,
"user_satisfaction": feedback["score"],
"user_feedback": feedback["comment"]
}
)
return feedback["score"]
π Summary & Key Takeaways
- LLM observability is fundamentally different: Non-deterministic outputs, variable costs, and multi-step reasoning require specialized tracing beyond traditional APM
- Three pillars matter most: Traces (full prompt-response journeys), Metrics (token-aware performance), and Logs (structured LLM events)
- LangSmith offers the fastest path: Zero-config tracing for LangChain with LLM-specific debugging UI, but creates vendor lock-in
- OpenTelemetry provides standardization: Vendor-neutral traces work with any backend, but require manual instrumentation for complex workflows
- LangFuse enables self-hosting: Full-featured open-source alternative with cost control and data sovereignty
- Token budgets prevent cost explosions: Always implement per-session token limits and context quality monitoring
- Quality degradation is silent: Set up proactive alerts for context relevance scores and hallucination patterns
- Start simple, scale complexity: Begin with request/response logging, add token tracking, then full distributed tracing
The key insight: Traditional "error-free" doesn't exist in LLM systemsβmonitor for quality degradation, cost drift, and silent failures instead.
π Practice Quiz
What is the most critical difference between traditional APM and LLM observability?
- A) LLMs require more CPU monitoring
- B) LLMs produce non-deterministic outputs that can't be compared with simple string matching
- C) LLMs only work with cloud-based monitoring tools
Correct Answer: B
A customer support chatbot's costs jumped from $500 to $3,000 overnight. Using LangSmith traces, you discover the 95th percentile token usage went from 2,000 to 15,000 tokens per request. What's the most likely root cause?
- A) OpenAI increased their pricing
- B) More users are asking questions
- C) Context stuffing - irrelevant documents being included in prompts
Correct Answer: C
Which observability approach offers the best vendor neutrality for LLM applications?
- A) LangSmith with automatic LangChain tracing
- B) OpenTelemetry with standardized LLM semantic conventions
- C) Custom logging with print statements
Correct Answer: B
[Open-ended challenge] Your RAG system retrieves 50 documents but only uses 3 relevant ones, wasting tokens on 47 irrelevant chunks. Design a monitoring strategy to detect and alert on this context inefficiency. What key metrics would you track, and what would trigger alerts? Discuss both technical implementation and business impact considerations.
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Redis Sorted Sets Explained: Skip Lists, Scores, and Real-World Use Cases
TLDR: Redis Sorted Sets (ZSETs) store unique members each paired with a floating-point score, kept in sorted order at all times. Internally they use a skip list for O(log N) range queries and a hash table for O(1) score lookup β giving you the best o...
Write-Time vs Read-Time Fan-Out: How Social Feeds Scale
TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time β fast reads but catastrophic write amplification for celebrities. Read-time fan-out (pull) computes feeds on demand...

Two Pointer Technique: Solving Pair and Partition Problems in O(n)
TLDR: Place one pointer at the start and one at the end of a sorted array. Move them toward each other based on a comparison condition. Every classic pair/partition problem that naively runs in O(nΒ²)

Tries (Prefix Trees): The Data Structure Behind Autocomplete
TLDR: A Trie stores strings character by character in a tree, so every string sharing a common prefix shares those nodes. Insert and search are O(L) where L is the word length. Tries beat HashMaps on
