Your agent just made three redundant API calls, retrieved stale context, and returned a confident but wrong answer — and your only visibility is 200 OK in your server logs.
This is the reality for most teams that have shipped AI agents. You’ve handled retries, structured your prompts, maybe even added evals to your CI pipeline. But in production, agents are still flying blind.
Traditional monitoring tools capture latency and error rates. They miss the decisions that actually matter: which tool was selected, why retrieval returned irrelevant chunks, how far a reasoning chain drifted before reaching its conclusion.
AI agent observability fills that gap. This guide walks through a complete, working setup using Langfuse and OpenTelemetry — covering instrumentation, multi-step trace capture, per-trace cost attribution, and production dashboards — so you can stop guessing and start debugging.
Why Your Agents Are a Black Box (and Why That’s a Production Risk)
Datadog can tell you your agent responded in 4.2 seconds. It cannot tell you that 3.8 of those seconds were spent on a redundant retrieval pass, or that the tool selected on step 2 was factually wrong for the user’s intent.
This isn’t a tooling gap — it’s an architectural mismatch. Traditional APM systems were built for deterministic services where the same input reliably produces the same output. LLM agents don’t work that way.
The same prompt can take radically different execution paths depending on context, temperature, and model state. Without observability designed for non-determinism, you’re debugging with tools built for a different class of problem.
The risk is measurable. Gartner found that 85% of AI projects fail, and organizations with structured monitoring frameworks are significantly more likely to achieve successful outcomes. Yet only 7% of companies currently use LLM observability in production extensively — while 47% are still in proof-of-concept territory (Grafana Labs Observability Survey 2025). That gap between deployment and visibility is where production incidents live.
Three failure modes are especially invisible without agent-level tracing:
- Tool selection drift — the agent calls a broadly capable tool when a precise one was available, degrading output quality gradually across thousands of sessions
- Retrieval quality collapse — your RAG pipeline starts returning irrelevant chunks after a document store update, but you don’t see it until users complain
- Cost loops — a planning agent that can’t resolve a subtask re-invokes itself repeatedly, accumulating token costs silently
You can’t fix what you can’t see.
AI Agent Observability Stack: Traces, Spans, and Generations Explained
Before touching any configuration, you need a mental model for how agent observability is structured — because it maps directly to how Langfuse organizes your data.
Traces are the top-level container for a single agent run. One user request equals one trace. Everything that happens during that run — planning, tool calls, retrieval, sub-agent invocations, final synthesis — lives inside it.
Spans are the individual steps within a trace. When your agent calls a web search tool, that’s a span. When it queries a vector database, that’s a span. Spans can be nested: a parent span for “research phase” might contain three child spans for the tool calls it made.
Generations are a special span type for LLM calls specifically. Langfuse tracks the model, input tokens, output tokens, and cost automatically at the generation level — which is what enables per-trace cost attribution later.
This three-layer model maps cleanly onto an agent’s lifecycle:
- User message arrives → new Trace created
- Agent enters planning phase → Span:
plan - Agent decides to call a tool → child Span:
tool:web_search - Agent sends prompt to GPT-4o → Generation (with token counts)
- Agent synthesizes final response → Span:
synthesize - Trace closes with final output, total cost, and duration
Once you see your agents through this lens, the Langfuse setup below will make immediate sense.
Why OpenTelemetry Is the Only Architecture Decision That Prevents Vendor Lock-In
Here’s where most teams make an expensive mistake: they instrument directly against a specific observability vendor’s SDK. It feels fast, it works — and six months later they want to route traces to a different backend and they’re looking at a full re-instrumentation project.
OpenTelemetry (OTEL) addresses this at the protocol level. It defines semantic conventions for LLM telemetry — specifically how to represent model names, token counts, prompts, completions, and tool calls in a vendor-neutral format. When you instrument to OTEL conventions, you can point your traces at Langfuse today and Arize or Datadog LLM Observability tomorrow without changing your instrumentation code.
For teams navigating how multi-model stacks behave in production, this becomes critical: as you swap models or add routing layers, your trace schema stays consistent across backends.
The practical setup:
– Use opentelemetry-sdk as your instrumentation layer
– Use Langfuse as the OTEL-compatible backend (it accepts OTLP exports natively)
– Your spans, attributes, and semantic conventions stay portable
The LLM observability market is growing at 36.3% CAGR and is projected to reach $9.26 billion by 2030 (Research and Markets, 2025). Multiple backends will emerge and consolidate. Betting on one vendor’s proprietary format today is a re-work tax you’ll pay later.
Setting Up Langfuse in Under 10 Minutes: A Step-by-Step Walkthrough
Langfuse is open-source and self-hostable, or you can use their managed cloud. Either way, the SDK setup is identical.
Step 1: Install the SDK
# Python
pip install langfuse openai
# Node/TypeScript
npm install langfuse openai
Step 2: Set environment variables
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted URL
Step 3: Wrap your OpenAI client (zero code changes)
from langfuse.openai import openai # Drop-in replacement
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this document..."}],
)
That’s it for basic tracing. Every call through the wrapped client automatically creates a trace with the model, prompt, completion, token counts, and latency — no manual instrumentation required.
Step 4: Enrich your traces
Raw traces are useful. Enriched traces are actionable:
from langfuse import Langfuse
langfuse = Langfuse()
trace = langfuse.trace(
name="document-summarizer",
user_id="user_8472",
session_id="session_abc123", # Critical for multi-turn correlation
tags=["production", "summarization"],
metadata={"document_type": "contract", "page_count": 12}
)
The session_id field deserves special attention. Without it, a multi-turn conversation appears as disconnected traces in your dashboard. With a stable session ID propagated across turns, you can reconstruct the full decision history of any conversation — which is the difference between “something went wrong” and “here’s exactly where and why it went wrong.”
Tracing Multi-Step Agents: Capturing Tool Calls, Sub-Agents, and Retrieval as Nested Spans
Single LLM calls are easy to trace. Multi-step agents — where a planner spawns subtasks, each of which calls tools and invokes retrieval — require explicit span nesting to produce a readable trace tree.
with langfuse.start_as_current_span(name="agent-run") as agent_span:
agent_span.update(input={"user_query": user_input})
# Tool call
with langfuse.start_as_current_span(name="tool:web_search") as tool_span:
result = web_search(query)
tool_span.update(output={"results": result})
# Retrieval step
with langfuse.start_as_current_span(name="retrieval:vector_db") as retrieval_span:
chunks = vector_db.query(embedding)
retrieval_span.update(output={"chunks_returned": len(chunks)})
# LLM synthesis — auto-traced through wrapped client
response = openai.chat.completions.create(...)
agent_span.update(output={"final_response": response})
The resulting trace in Langfuse shows a waterfall of nested spans, with timing and token data at each level. If retrieval consumed 3.2 seconds of a 3.5-second trace, you know immediately the optimization target is the vector DB query — not the LLM call.
For teams running agents in parallel with isolated contexts, proper span isolation becomes essential: each agent run needs its own trace so parallel execution doesn’t produce interleaved, unreadable data in the dashboard.
LangGraph’s self-correcting agents and the OpenAI Agents SDK both have Langfuse integrations that handle span nesting automatically via callbacks. Understanding the manual approach above means you can instrument any custom agent architecture, regardless of framework.
Tracking Costs Per Trace: How to Catch Runaway Agent Loops Before They Burn Your Budget
Token costs accumulate invisibly until they don’t. A planning agent that can’t resolve a subtask might loop eight times on the same prompt before hitting a retry limit — and each iteration costs real money. Without per-trace attribution, you see an elevated monthly bill, not the specific agent pattern driving it.
Langfuse tracks cost at the Generation level automatically when you use the wrapped OpenAI client. Every trace in the dashboard shows:
– Total input and output tokens
– Model used per generation
– Dollar cost at current model pricing
– Cost breakdown across individual spans
To surface cost anomalies proactively, add a threshold check to your agent’s exit logic:
trace = langfuse.get_current_trace()
if trace.total_cost > 0.05: # $0.05 per trace ceiling
logger.warning(f"High-cost trace: {trace.id}, cost: ${trace.total_cost:.4f}")
# Alert or terminate the agent loop
You can also use Langfuse’s scoring API to tag high-cost traces for batch review — which feeds directly into the alerting strategy in the final section.
The pattern to watch for: traces where token count grows linearly with turn number. In a healthy multi-turn agent, later turns should reference earlier context efficiently via summarization or compression. If token counts keep climbing, the agent is re-processing previous context instead of compressing it — a prompt engineering problem you can now detect and quantify.
Evaluating Agent Quality in Production: Black-Box, Trajectory, and Step-Level Strategies
Most teams treat evaluation as a pre-deployment concern. Production is where it actually matters — because real user inputs expose failure modes that test suites never anticipated.
Three strategies apply at different layers of agent behavior:
Black-box scoring evaluates only the final output. Did the answer match the user’s intent? Was it factually grounded? This is the easiest to implement — typically an LLM-as-judge call — and catches the most obvious failures. Attach a quality score to each trace via Langfuse’s scoring API:
langfuse.score(
trace_id=trace.id,
name="output-quality",
value=0.85,
comment="Accurate but omitted key date context"
)
Trajectory-based evaluation asks whether the agent took a reasonable path to its answer. An agent that arrived at the correct answer via seven redundant tool calls is still broken — it just recovered. Trajectory evals score the sequence of tool selections against an expected path, which requires your span data to be complete and consistently structured.
Step-level analysis zooms into individual spans to evaluate specific decisions: Was the right tool called for this subtask? Did the retrieval step return relevant chunks? This is the most granular — and most powerful for diagnosing systematic tool selection errors across thousands of runs. When you see span-level LLM evaluation in production, the added signal compared to end-to-end evals alone is substantial.
In practice: start with black-box scoring on 100% of traces using an automated LLM judge. Apply trajectory evaluation on any trace where the quality score drops below your threshold. Reserve step-level analysis for postmortems on high-value sessions that went wrong.
Building a Production Monitoring Dashboard: The Three-Layer Model That Actually Works
A dashboard with 30 metrics tells you nothing during an incident. Structure your monitoring in three focused layers, each answering a different question:
Layer 1: Business KPIs
– Task completion rate (did the agent accomplish what the user asked?)
– User satisfaction signal (thumbs up/down or implicit signals)
– Session success rate by agent type and version
Layer 2: System Health
– P50/P95/P99 latency per agent type
– Error rate with categorization (tool failure, model timeout, context overflow)
– Hallucination rate via automated LLM judge scoring
– Tool selection accuracy for agents with defined expected execution paths
Layer 3: Cost and Resources
– Cost per trace by agent type and model
– Token utilization trend over time
– High-cost trace frequency and clustering by input pattern
The alert strategy matters as much as the metrics themselves. According to LogicMonitor’s 2026 Observability and AI Trends report, 36% of organizations are already dealing with alert fatigue. For AI agents specifically, failure modes are too nuanced for simple threshold alerts across every metric.
Page immediately on:
– Error rate exceeding 5% over any 5-minute window
– Any trace cost exceeding 10× the agent’s median (indicates a loop)
– Model timeout rate above 1%
Review daily:
– Quality score trend (are scores drifting down week-over-week?)
– Tool selection accuracy by agent version after prompt or model changes
– P95 latency shifts following deployments
Langfuse scoring hooks let you trigger external alerts based on quality score thresholds — not just error rates. This is the critical architectural difference from traditional APM: you can page on “quality degraded” before an explicit error ever surfaces. Gartner predicts LLM observability investments will cover 50% of GenAI deployments by 2028, up from just 15% today — teams building this alerting layer now will be ahead of that curve, not scrambling to catch up.
Start Seeing What Your Agents Are Actually Doing
AI agent observability is the feedback loop that makes optimization possible. Without trace data, every prompt change is a guess. With it, you can see exactly which tool calls failed, which retrieval steps returned garbage, and which agent runs cost 10× more than they should.
The stack here — Langfuse for trace storage and analysis, OpenTelemetry for vendor-neutral instrumentation, structured span nesting for multi-step agents — gives you everything you need without locking into a proprietary format. The LLM observability market is moving fast; your instrumentation schema shouldn’t require a rewrite every time a better backend appears.
Start with the OpenAI drop-in wrapper for zero-effort baseline tracing. Add session IDs and user metadata immediately — the multi-turn correlation alone is worth the two extra lines. Then layer in cost tracking and quality scoring as your agents handle more production volume. The signal is there. You just need to instrument for it.
If you’re running LangGraph, check the Langfuse LangGraph integration in their docs — it handles span nesting automatically via callbacks with no manual span management required.