LLM Observability: Tracing, Logging, and Debugging Production AI Systems

How to Monitor Non-Deterministic AI Systems with LangSmith, OpenTelemetry, and LangFuse

LLM Engineering

Abstract Algorithms

·Mar 29, 2026·19 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 19 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: 🔍 LLM observability is radically different from traditional APM—non-deterministic outputs, variable token costs, and multi-step reasoning chains require specialized tracing. LangSmith provides native LangChain integration, OpenTelemetry offers standardization, and LangFuse delivers open-source flexibility. The key: instrument every prompt, capture every token, and alert on cost spikes before they hit your budget.

Your LLM app passed all your evals. In production, 15% of users are getting hallucinated answers and your costs are 3x what you budgeted. You have no idea why because you have no visibility into what the model is actually doing.

Welcome to the unique hell of debugging non-deterministic systems. Unlike traditional web services where HTTP 500 errors are binary failures, LLM applications fail quietly with plausible-sounding nonsense. A user asks for "Python sorting algorithms" and gets a perfectly formatted response about JavaScript promises. Your metrics show 200 OK, but your users get garbage.

This is why traditional Application Performance Monitoring (APM) tools fall short with LLMs. You need specialized observability that captures prompts, tokens, reasoning chains, and the probabilistic nature of AI outputs.

📖 The LLM Observability Challenge: Why Traditional APM Falls Short

Traditional monitoring assumes deterministic systems. Give the same input, get the same output, measure latency and error rates. LLMs shatter these assumptions:

Non-deterministic outputs mean the same prompt can produce different responses based on temperature settings, model state, or even cosmic rays. You can't just compare response strings to detect anomalies.

Variable token costs make every request different. A simple question might cost 100 tokens, while follow-up reasoning explodes to 10,000 tokens. Your cost per request varies by 100x based on user behavior.

Multi-step reasoning chains create complex execution flows. An agent might query a database, call three APIs, perform web searches, and synthesize results. Traditional request tracing captures the HTTP calls but misses the reasoning steps.

Context stuffing happens silently. Your RAG system retrieves 50 documents, but only uses 3. You're paying for 47 irrelevant chunks without knowing it.

Prompt drift occurs as dynamic templates change based on user data. The same logical query generates different prompts, making it impossible to track performance over time.

Here's what traditional metrics miss:

Traditional APM	LLM Reality
Response time: 200ms	Time-to-first-token vs. total generation time
Error rate: 2% HTTP 500s	Hallucination rate: 15% plausible but wrong
Throughput: 1000 RPS	Token throughput: variable by 100x per request
Cost: predictable per request	Cost: $0.001 to $1.00+ per request

🔍 The Three Pillars of LLM Observability

LLM observability rests on three pillars that extend traditional monitoring:

Traces: Full Prompt-to-Response Journeys

Unlike HTTP request traces, LLM traces capture:

Prompt templates with variable substitution
Context retrieval and document ranking
Multi-step reasoning chains in agents
Tool usage and API calls made by the model
Output parsing and validation steps

Each trace shows the complete decision path, not just network hops.

Metrics: Token-Aware Performance Indicators

Key metrics specific to LLMs:

Metric	Why It Matters	Alert Threshold
Token usage per request	Cost control	90th percentile > 5000 tokens
Time-to-first-token (TTFT)	User experience	> 2 seconds
Total latency	End-to-end experience	> 30 seconds
Cost per session	Budget burn rate	> $0.50 per session
Hallucination rate	Quality degradation	> 5% of responses
Context utilization	Efficiency	< 30% of retrieved docs used

Logs: Structured Events with LLM Context

Traditional logs capture HTTP status codes. LLM logs need:

Prompt construction events
Model selection and routing decisions
Context retrieval results with relevance scores
Tool execution outcomes
Output validation failures

Each log entry includes the full conversational context to enable debugging.

⚙️ How LangSmith Traces Every Token and Decision in Your AI Pipeline

LangSmith is LangChain's native observability platform, designed specifically for LLM applications. It captures the full execution graph of chains, agents, and tools with zero configuration.

Automatic Instrumentation

LangSmith automatically traces all LangChain components:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langsmith import traceable

# Set LangSmith API key - no other config needed
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_key"

# This chain is automatically traced
llm = ChatOpenAI(temperature=0.7)
prompt = ChatPromptTemplate.from_template(
    "You are a helpful assistant. Answer this question: {question}"
)
chain = LLMChain(llm=llm, prompt=prompt)

# Every invoke() call creates a trace
response = chain.invoke({"question": "What is quantum computing?"})

LangSmith captures:

Prompt template with variables
Token counts (input and output)
Model parameters (temperature, max_tokens)
Execution time (TTFT and total)
Response content and metadata

Manual Instrumentation with @traceable

For custom functions outside LangChain:

from langsmith import traceable
import requests

@traceable
def fetch_context_documents(query: str, top_k: int = 5) -> list:
    """Retrieve relevant documents for RAG context."""
    # Custom retrieval logic
    response = requests.post("http://vector-db:8000/search", {
        "query": query,
        "limit": top_k
    })

    docs = response.json()["documents"]

    # LangSmith automatically captures inputs, outputs, and metadata
    return docs

@traceable  
def synthesize_response(question: str, context_docs: list) -> str:
    """Generate response using retrieved context."""
    context = "\n".join([doc["content"] for doc in context_docs])

    prompt = f"""
    Context: {context}

    Question: {question}

    Answer based only on the context provided.
    """

    # This LLM call is also traced automatically
    llm = ChatOpenAI(temperature=0.1)
    response = llm.invoke(prompt)

    return response.content

# Usage creates nested traces
def answer_question(user_question: str):
    docs = fetch_context_documents(user_question)
    answer = synthesize_response(user_question, docs)
    return answer

RunTree for Complex Workflows

For maximum control over tracing:

from langsmith import Client
from langsmith.run_trees import RunTree

client = Client()

def process_user_query(question: str, user_id: str):
    # Create parent trace
    with RunTree(
        name="user_query_processing",
        run_type="chain",
        inputs={"question": question, "user_id": user_id},
        client=client
    ) as parent_run:

        # Step 1: Query classification
        with parent_run.create_child(
            name="classify_intent",
            run_type="llm"
        ) as classify_run:
            intent = classify_query_intent(question)
            classify_run.end(outputs={"intent": intent})

        # Step 2: Context retrieval  
        with parent_run.create_child(
            name="retrieve_context", 
            run_type="retriever"
        ) as retrieval_run:
            docs = fetch_relevant_docs(question, intent)
            retrieval_run.end(outputs={
                "num_docs": len(docs),
                "relevance_scores": [d["score"] for d in docs]
            })

        # Step 3: Response generation
        with parent_run.create_child(
            name="generate_response",
            run_type="llm"  
        ) as generation_run:
            response = generate_answer(question, docs)
            generation_run.end(outputs={"response": response})

        parent_run.end(outputs={"final_answer": response})

LangSmith's dashboard shows the complete execution tree with timing, costs, and intermediate outputs for every step.

🧠 Deep Dive: OpenTelemetry for Standardized LLM Instrumentation

The Internals

OpenTelemetry (OTel) provides vendor-neutral observability for LLM applications through standardized spans and metrics. The LLM semantic conventions define specific attributes for AI workloads:

Span Structure:

operation.name: "llm.completion"
llm.vendor: "openai" 
llm.model.name: "gpt-4"
llm.model.version: "2024-02-15-preview"
llm.temperature: 0.7
llm.max_tokens: 1000
llm.token_count.prompt: 156
llm.token_count.completion: 89  
llm.latency.time_to_first_token: 1.2s

Memory Layout: OTel stores traces in memory as linked span objects, with each span containing attributes, events, and links to parent/child spans. When memory limits are reached, spans are exported to backends like Jaeger or DataDog.

State Management: The tracer maintains active span context using thread-local storage or async context variables, enabling automatic parent-child relationships across async operations.

Performance Analysis

Time Complexity: O(1) for span creation and attribute setting. The tracer uses hash tables for span lookup and linked lists for span relationships.

Space Complexity: O(n) where n is the number of active spans. Each span stores ~1KB of metadata plus variable attribute data.

Bottlenecks:

Span export becomes the limiting factor at scale (1000+ spans/second)
Attribute serialization for complex objects (embeddings, large prompts)
Network I/O to observability backends during export batches

Mitigation: Use async exporters, compress span data, and sample high-volume operations.

📊 Visualizing LLM Request Flows with Distributed Tracing

LLM applications create complex execution graphs that traditional request tracing can't capture. Here's how a typical RAG agent execution looks with proper instrumentation:

graph TD
    A[User Question] --> B[Intent Classification]
    B --> C[Vector Search]
    B --> D[Knowledge Graph Query]  
    C --> E[Document Ranking]
    D --> F[Entity Resolution]
    E --> G[Context Assembly]
    F --> G
    G --> H[Prompt Construction]
    H --> I[LLM Generation]
    I --> J[Output Validation]
    J --> K{Valid Response?}
    K -->|Yes| L[Response Formatting]
    K -->|No| M[Fallback Generation]
    M --> N[Secondary LLM Call]
    N --> L
    L --> O[User Response]

    style B fill:#e1f5fe
    style C fill:#e8f5e8  
    style D fill:#e8f5e8
    style I fill:#fff3e0
    style N fill:#fff3e0
    style J fill:#fce4ec

Each node in this graph becomes a span in your distributed trace, with the following key attributes:

Span Type	Key Attributes	Example Values
Intent Classification	`llm.prompt`, `llm.tokens.input`, `classification.confidence`	"Classify: 'What is RAG?'", 23, 0.94
Vector Search	`search.query`, `search.results.count`, `search.similarity.threshold`	"RAG retrieval", 15, 0.7
LLM Generation	`llm.model`, `llm.tokens.total`, `llm.cost.usd`	"gpt-4-turbo", 1247, $0.031
Output Validation	`validation.rules`, `validation.passed`, `validation.errors`	["factual", "relevant"], true, []

🌍 Real-World Debugging: From Prompt Drift to Cost Explosions

Case Study 1: The Hallucinating Support Bot

Situation: A customer support chatbot started giving incorrect refund policies, causing compliance issues.

Investigation with LangSmith:

Trace Analysis: Filtered traces by "refund" keyword, found 23% contained hallucinated policy details
Prompt Inspection: Discovered the knowledge base retrieval was returning outdated documents
Context Quality: Document relevance scores averaged 0.4 (below 0.7 threshold)

Root Cause: The vector database hadn't been updated with new policies, so context retrieval failed silently.

Solution: Added context quality monitoring and alerts when average relevance drops below 0.6.

@traceable
def validate_context_quality(retrieved_docs: list, threshold: float = 0.6):
    avg_score = sum(doc["relevance_score"] for doc in retrieved_docs) / len(retrieved_docs)

    if avg_score < threshold:
        # Log warning and trigger alert
        logger.warning(f"Context quality below threshold: {avg_score:.2f}")
        send_slack_alert(f"RAG context quality degraded to {avg_score:.2f}")

    return avg_score

Case Study 2: Token Cost Explosion

Situation: Daily LLM costs jumped from $500 to $3,000 overnight with no change in user volume.

Investigation with OpenTelemetry Metrics:

# Cost tracking with OTel metrics
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
cost_counter = meter.create_counter(
    "llm_cost_total_usd",
    description="Total LLM API costs"
)
token_histogram = meter.create_histogram(
    "llm_tokens_per_request", 
    description="Token usage distribution"
)

def track_llm_costs(model_name: str, input_tokens: int, output_tokens: int):
    # OpenAI GPT-4 pricing: $0.03/1K input, $0.06/1K output  
    input_cost = (input_tokens / 1000) * 0.03
    output_cost = (output_tokens / 1000) * 0.06
    total_cost = input_cost + output_cost

    # Record metrics with attributes
    cost_counter.add(total_cost, {"model": model_name, "cost_type": "api_usage"})
    token_histogram.record(input_tokens + output_tokens, {"model": model_name})

Root Cause Analysis: The 95th percentile token usage jumped from 2,000 to 15,000 tokens per request. Tracing revealed that a code generation feature was including entire codebases as context.

Solution: Implemented token budget limits and smart context truncation:

def truncate_context_by_budget(documents: list, token_budget: int = 4000):
    """Keep only the most relevant docs within token budget."""
    sorted_docs = sorted(documents, key=lambda x: x["relevance_score"], reverse=True)

    total_tokens = 0
    selected_docs = []

    for doc in sorted_docs:
        doc_tokens = count_tokens(doc["content"])
        if total_tokens + doc_tokens <= token_budget:
            selected_docs.append(doc)
            total_tokens += doc_tokens
        else:
            break

    return selected_docs, total_tokens

⚖️ Trade-offs: Native Tools vs. Open Standards vs. Cost

Performance vs. Vendor Lock-in

LangSmith provides the richest LLM-specific features but locks you into the LangChain ecosystem:

Pros:

Zero-config automatic instrumentation for LangChain
LLM-specific UI for prompt debugging and dataset comparison
Built-in evaluation workflows and A/B testing

Cons:

Vendor lock-in to LangChain architecture
Additional cost on top of LLM API fees
Limited customization of trace data structure

OpenTelemetry offers vendor neutrality but requires more setup:

Pros:

Send traces to any backend (DataDog, New Relic, Grafana)
Standardized LLM semantic conventions
No vendor lock-in, full control over data

Cons:

Manual instrumentation for non-standard LLM workflows
Generic observability UIs lack LLM-specific debugging features
More complex setup and configuration

Correctness vs. Cost Trade-offs

Comprehensive tracing captures every prompt and response, but storage costs scale with volume:

Trace Sampling Rate	Monthly Cost (100K requests)	Debug Capability
100% (all traces)	$200-500	Full debugging
10% (sample)	$20-50	Statistical analysis only
1% (errors + sample)	$5-15	Error debugging only

Failure Modes:

Sampling bias: Critical edge cases missed in sampled data
Storage explosion: Full prompt/response traces consume 10x more space than HTTP logs
Privacy leaks: Traces contain user data and proprietary prompts

Mitigation Strategies:

# Adaptive sampling based on error rates and costs
def get_sampling_rate(error_rate: float, daily_cost: float) -> float:
    if error_rate > 0.05:  # High error rate
        return 1.0  # Sample everything
    elif daily_cost > 1000:  # High cost
        return 0.1  # Sample 10%  
    else:
        return 0.05  # Standard 5% sampling

🧭 Decision Guide: Choosing Your LLM Observability Stack

Situation	Recommendation
Use LangSmith when	Building with LangChain, need rapid prototyping, team < 10 developers, budget allows vendor tooling
Use OpenTelemetry when	Multi-vendor setup, existing OTel infrastructure, need custom trace backends, compliance requirements
Use LangFuse when	Self-hosted requirements, open-source preference, custom evaluation metrics, cost optimization priority
Alternative approaches	DataDog APM + custom LLM metrics, Grafana + Prometheus for metrics-only, ELK stack for log-centric debugging
Edge cases	High-security environments (air-gapped deployments), extreme scale (1M+ requests/day), multi-cloud architectures

🧪 Practical Examples: Instrumenting a LangChain Agent with LangSmith

Example 1: Customer Support Agent with Tool Usage

import os
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate  
from langchain.tools import Tool
from langsmith import traceable
import requests

# Enable LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_key"
os.environ["LANGCHAIN_PROJECT"] = "customer-support-agent"

@traceable
def search_knowledge_base(query: str) -> str:
    """Search internal knowledge base for support articles."""
    response = requests.post("http://kb-api:8000/search", {
        "query": query,
        "max_results": 3
    })
    articles = response.json()["articles"]

    # LangSmith captures this function's inputs/outputs automatically
    return "\n".join([f"Article: {a['title']}\n{a['content']}" for a in articles])

@traceable  
def get_order_status(order_id: str) -> str:
    """Fetch current order status from order management system."""
    response = requests.get(f"http://orders-api:8000/orders/{order_id}")
    order = response.json()

    return f"Order {order_id}: Status = {order['status']}, Expected delivery = {order['delivery_date']}"

# Define tools for the agent
tools = [
    Tool(
        name="search_knowledge_base",
        description="Search support articles and documentation",
        func=search_knowledge_base
    ),
    Tool(
        name="get_order_status", 
        description="Get current status of customer orders by order ID",
        func=get_order_status
    )
]

# Create the agent
llm = ChatOpenAI(temperature=0.1, model="gpt-4")
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful customer support agent. Use the available tools to help customers.

    Available tools:
    - search_knowledge_base: Find relevant support articles
    - get_order_status: Check order status by order ID

    Always be polite and provide specific, actionable information."""),
    ("human", "{input}"),
    ("assistant", "{agent_scratchpad}")
])

agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Usage - all steps automatically traced in LangSmith
def handle_support_request(customer_message: str, customer_id: str):
    with traceable(
        name="support_request",
        metadata={"customer_id": customer_id}
    ):
        response = agent_executor.invoke({
            "input": customer_message
        })
        return response["output"]

# Example usage
result = handle_support_request(
    "Hi, I need help with my order #12345. It was supposed to arrive yesterday.",
    customer_id="cust_789"
)
print(result)

LangSmith Dashboard Output:

Agent Execution Trace: Shows the reasoning steps, tool calls, and final response
Tool Usage Metrics: Number of API calls, response times, success rates
Token Consumption: Tracks tokens used in system prompts, tool descriptions, and responses
Cost Attribution: Breaks down costs by tool usage, model calls, and customer session

Example 2: RAG System with Context Quality Monitoring

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langsmith import traceable
import numpy as np

class ObservableRAG:
    def __init__(self, vector_store_path: str):
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = Chroma(
            persist_directory=vector_store_path,
            embedding_function=self.embeddings
        )
        self.llm = ChatOpenAI(temperature=0.1, model="gpt-4")

    @traceable
    def retrieve_with_quality_check(self, query: str, k: int = 5) -> dict:
        """Retrieve documents and assess context quality."""

        # Get similarity search results with scores
        docs_with_scores = self.vectorstore.similarity_search_with_score(query, k=k)

        # Calculate context quality metrics
        scores = [score for _, score in docs_with_scores]
        avg_similarity = np.mean(scores)  
        min_similarity = min(scores)
        score_variance = np.var(scores)

        docs = [doc for doc, _ in docs_with_scores]

        # Log quality metrics for monitoring
        context_quality = {
            "avg_similarity": avg_similarity,
            "min_similarity": min_similarity, 
            "score_variance": score_variance,
            "num_docs_retrieved": len(docs),
            "query_length": len(query.split())
        }

        # Alert on poor context quality
        if avg_similarity < 0.6:
            self._log_quality_alert("Low average similarity", context_quality)

        if min_similarity < 0.3:
            self._log_quality_alert("Very low minimum similarity", context_quality)

        return {
            "documents": docs,
            "quality_metrics": context_quality,
            "similarity_scores": scores
        }

    @traceable
    def _log_quality_alert(self, alert_type: str, metrics: dict):
        """Log context quality alerts for monitoring."""
        print(f"ALERT: {alert_type} - Metrics: {metrics}")
        # In production, send to monitoring system

    @traceable
    def generate_answer(self, query: str) -> dict:
        """Generate answer with full observability."""

        # Step 1: Retrieve and assess context
        retrieval_result = self.retrieve_with_quality_check(query)
        docs = retrieval_result["documents"]
        quality = retrieval_result["quality_metrics"]

        # Step 2: Build context-aware prompt
        context = "\n\n".join([doc.page_content for doc in docs])

        prompt = f"""
        Use the following context to answer the question. If the context doesn't contain 
        relevant information, say so explicitly.

        Context:
        {context}

        Question: {query}

        Answer:
        """

        # Step 3: Generate response
        response = self.llm.invoke(prompt)

        # Step 4: Return with full metadata
        return {
            "answer": response.content,
            "context_quality": quality,
            "num_context_chars": len(context),
            "source_documents": len(docs)
        }

# Usage with automatic LangSmith tracing
rag = ObservableRAG("./chroma_db")

result = rag.generate_answer("What are the key principles of microservices architecture?")
print(f"Answer: {result['answer']}")
print(f"Context Quality: {result['context_quality']}")

Monitoring Dashboard Insights:

Context Quality Trends: Track average similarity scores over time to detect knowledge base drift
Retrieval Performance: Monitor query types that produce low-quality contexts
Cost Per Question: Understand token usage patterns by question complexity
Answer Quality Correlation: Compare context quality with user satisfaction ratings

🛠️ LangFuse: Open-Source Alternative for Self-Hosted Observability

LangFuse provides comprehensive LLM observability without vendor lock-in, offering features comparable to LangSmith with full self-hosting capabilities.

Key Features:

Distributed tracing for multi-step LLM workflows
Cost tracking with granular token-level attribution
A/B testing and evaluation workflows
Dataset management for prompt engineering
Team collaboration with shared dashboards

Self-Hosted Setup:

# Install LangFuse SDK
# pip install langfuse

import os
from langfuse import Langfuse
from langfuse.decorators import observe

# Initialize self-hosted LangFuse instance
langfuse = Langfuse(
    host="https://your-langfuse-instance.com",
    public_key="pk_your_public_key", 
    secret_key="sk_your_secret_key"
)

@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4") -> str:
    """LLM call with automatic LangFuse tracing."""
    from openai import OpenAI
    client = OpenAI()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )

    # LangFuse automatically captures:
    # - Input prompt
    # - Model parameters
    # - Token usage
    # - Response content
    # - Timing information

    return response.choices[0].message.content

@observe(as_type="span") 
def process_user_query(query: str) -> dict:
    """Main processing pipeline with nested tracing."""

    # Step 1: Intent classification
    intent_prompt = f"Classify this user query: {query}"
    intent = call_llm(intent_prompt, model="gpt-3.5-turbo")

    # Step 2: Generate response based on intent
    if "technical" in intent.lower():
        response_prompt = f"Provide a technical answer to: {query}"
        response = call_llm(response_prompt, model="gpt-4")
    else:
        response_prompt = f"Provide a simple answer to: {query}"  
        response = call_llm(response_prompt, model="gpt-3.5-turbo")

    return {
        "intent": intent,
        "response": response,
        "model_used": "gpt-4" if "technical" in intent.lower() else "gpt-3.5-turbo"
    }

# Usage creates nested traces in LangFuse
result = process_user_query("How does neural attention work in transformers?")

Cost Analysis Dashboard:

# Custom cost tracking with LangFuse events
@observe()
def track_session_cost(user_id: str, session_id: str, total_tokens: int, model: str):
    """Track costs at session level for budgeting."""

    # Model pricing (tokens per USD)
    pricing = {
        "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
        "gpt-3.5-turbo": {"input": 0.001, "output": 0.002}
    }

    # Estimate cost (simplified)
    cost_per_1k = pricing[model]["input"]  # Assume mostly input tokens
    estimated_cost = (total_tokens / 1000) * cost_per_1k

    # Log to LangFuse for cost analytics
    langfuse.event(
        name="session_cost_tracking",
        properties={
            "user_id": user_id,
            "session_id": session_id, 
            "total_tokens": total_tokens,
            "model": model,
            "estimated_cost_usd": estimated_cost
        }
    )

    return estimated_cost

LangFuse vs. LangSmith Comparison:

Feature	LangFuse	LangSmith
Hosting	Self-hosted or cloud	LangChain cloud only
Cost	Free (self-hosted) + infrastructure	$20-200/month per project
Data Control	Full control, on-premise	Data stored with LangChain
LangChain Integration	Manual instrumentation	Automatic tracing
Custom Metrics	Full flexibility	Predefined LLM metrics
Team Features	Open-source collaboration	Built-in team management

📚 Lessons Learned from Production LLM Monitoring

Key Insights from Real Deployments

Token Budget Monitoring is Critical: One startup burned through their entire monthly OpenAI budget in 3 days because a recursive agent got stuck in a reasoning loop. Always implement token limits per session.

Context Quality Degrades Silently: Vector databases can drift as content changes, but retrieval still returns "relevant" documents with lower similarity scores. Set up automated alerts when average similarity drops below acceptable thresholds.

Temperature Matters More Than You Think: A chatbot with temperature=0.9 produced creative but factually incorrect responses. Users complained about "hallucinations" that were actually intended randomness. Monitor temperature settings per use case.

Common Pitfalls to Avoid

Over-instrumenting Low-Value Calls: Tracing every internal function call creates noise. Focus on user-facing operations, tool usage, and model calls only.

Ignoring Async Context Propagation: In Python async environments, trace context can be lost between awaits. Use proper context managers:

import asyncio
from contextvars import copy_context

async def async_llm_call():
    # Wrong: context lost in async calls
    await some_async_operation()

    # Right: preserve trace context  
    ctx = copy_context()
    await ctx.run(some_async_operation)

Storing PII in Traces: User data in prompts becomes searchable in observability tools. Sanitize sensitive data:

def sanitize_prompt(prompt: str) -> str:
    """Remove PII before logging."""
    import re

    # Remove email patterns
    prompt = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', prompt)

    # Remove phone patterns  
    prompt = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', prompt)

    # Remove credit card patterns
    prompt = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', prompt)

    return prompt

Best Practices for Implementation

Start Simple, Scale Complex: Begin with basic request/response logging, then add token tracking, then full distributed tracing. Don't boil the ocean on day one.

Alert on Business Impact: Technical metrics like latency matter, but business metrics like hallucination rate and customer satisfaction matter more. Correlate them:

def calculate_user_satisfaction_score(session_id: str) -> float:
    """Correlate technical metrics with user feedback."""

    # Get technical metrics
    session_traces = langfuse.get_session_traces(session_id)
    avg_latency = calculate_avg_latency(session_traces)
    token_efficiency = calculate_token_efficiency(session_traces)

    # Get user feedback
    feedback = get_user_feedback(session_id)  # thumbs up/down

    # Store correlation for analysis
    langfuse.event(
        name="satisfaction_correlation",
        properties={
            "session_id": session_id,
            "avg_latency_ms": avg_latency,
            "token_efficiency": token_efficiency, 
            "user_satisfaction": feedback["score"],
            "user_feedback": feedback["comment"]
        }
    )

    return feedback["score"]

📌 Summary & Key Takeaways

LLM observability is fundamentally different: Non-deterministic outputs, variable costs, and multi-step reasoning require specialized tracing beyond traditional APM
Three pillars matter most: Traces (full prompt-response journeys), Metrics (token-aware performance), and Logs (structured LLM events)
LangSmith offers the fastest path: Zero-config tracing for LangChain with LLM-specific debugging UI, but creates vendor lock-in
OpenTelemetry provides standardization: Vendor-neutral traces work with any backend, but require manual instrumentation for complex workflows
LangFuse enables self-hosting: Full-featured open-source alternative with cost control and data sovereignty
Token budgets prevent cost explosions: Always implement per-session token limits and context quality monitoring
Quality degradation is silent: Set up proactive alerts for context relevance scores and hallucination patterns
Start simple, scale complexity: Begin with request/response logging, add token tracking, then full distributed tracing

The key insight: Traditional "error-free" doesn't exist in LLM systems—monitor for quality degradation, cost drift, and silent failures instead.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read