OpenTelemetry LLM Observability for AI Agents

Pick any LLM observability tool today and you’re making a bet — on a vendor, a data format, and an API that may not exist in two years. Langfuse, LangSmith, Helicone: each has its own SDK, its own trace format, and no path to correlate your AI agent traces with the rest of your infrastructure. When you switch providers or scale to multi-agent systems, you’re starting over.

OpenTelemetry LLM observability changes that equation. The OpenTelemetry GenAI Semantic Conventions define a vendor-neutral standard for tracing LLM calls, tracking token usage, and observing AI agent workflows — one that ships to Datadog, Grafana, Uptrace, or any OTLP-compatible backend without a single code change. This guide covers the full instrumentation strategy: which attributes to capture, where to put prompt content, how to model agent workflows as span hierarchies, and how to survive the experimental-to-stable migration.

Why Proprietary LLM Observability SDKs Are a Trap

Every major LLM observability vendor ships an SDK purpose-built to keep you in their ecosystem. The Langfuse observability SDK wraps your LLM calls in its own client. LangSmith patches your LangChain callbacks. Datadog’s ddtrace integration emits Datadog-specific attributes. None of these talk to each other, and none of them integrate naturally with your existing distributed traces.

The practical consequences stack up fast. There’s no correlation between AI agent spans and the upstream HTTP requests or database queries that triggered them. Multiple LLM providers mean duplicate instrumentation — each one needs its own wrapper. And every time you evaluate a new observability backend, you pay migration cost on instrumentation code that shouldn’t have been vendor-specific in the first place.

The deeper problem is data portability. If your prompt content, token counts, and model metadata live in a proprietary schema, you can’t query them with standard tooling or move them to a cheaper backend without an ETL job.

OpenTelemetry solves this at the specification layer. By adopting gen_ai.* span attributes as part of a vendor-agnostic AI architecture, your telemetry becomes a first-class OTLP payload that any compliant backend can ingest, index, and query — today and five years from now.

The OpenTelemetry GenAI Semantic Conventions at a Glance — Spans, Metrics, and Events

The OpenTelemetry GenAI SIG launched in April 2024 with a focused mandate: define a stable, provider-neutral telemetry model for generative AI workloads. As of early 2026, most conventions remain in experimental status — but that shouldn’t stop production adoption (more on that in the last section).

The conventions cover three telemetry signals:

  • Traces (spans): The primary signal. Each LLM call, tool invocation, embedding request, or agent creation becomes a span with structured gen_ai.* attributes. This is where latency, model metadata, and token counts live.
  • Metrics: Pre-defined metric names for token usage rates, request duration histograms, and error rates. Useful for dashboards and alerting without writing custom aggregation logic.
  • Events: The right place for prompt content and completions. Events are log records attached to a span — filterable, droppable, and not indexed by default.

The spec currently defines technology-specific conventions for four providers: Anthropic, Azure AI Inference, AWS Bedrock, and OpenAI. The base gen_ai.* attributes work across all of them. Provider-specific extensions handle things like Anthropic’s thinking tokens or Azure’s deployment names.

Core gen_ai.* Span Attributes You Must Capture on Every LLM Call

Every LLM call span should carry a consistent set of attributes. These are the ones that matter for routing, debugging, and cost attribution:

Attribute Type What It Tells You
gen_ai.operation.name string "chat", "text_completion", "embedding", "create_agent"
gen_ai.system string Provider: "openai", "anthropic", "aws_bedrock"
gen_ai.request.model string The model you asked for
gen_ai.response.model string The model that actually responded (can differ)
gen_ai.usage.input_tokens int Total input tokens billed
gen_ai.usage.output_tokens int Output tokens generated
gen_ai.response.finish_reasons string[] ["stop"], ["tool_calls"], ["max_tokens"]

gen_ai.response.finish_reasons is easy to overlook but invaluable in production. A high rate of max_tokens finishes means your context window budgeting is off. tool_calls finishes let you track how often your agent is invoking tools versus producing final responses.

For request configuration, capture gen_ai.request.max_tokens, gen_ai.request.temperature, and gen_ai.request.top_p when they differ from defaults. These matter for reproducibility — when a completion degrades in quality, you want to know whether someone changed the temperature.

The auto-instrumentation packages for OpenAI (opentelemetry-instrumentation-openai) and Anthropic handle most of these attributes automatically. For custom providers or internal model APIs, you’ll set them manually via the OTel SDK.

Prompts and Completions Belong in Events, Not Attributes — Here’s Why

This is the most commonly misunderstood part of GenAI instrumentation, and getting it wrong has real consequences.

Span attributes are indexed. Every backend — Datadog, Jaeger, Tempo — stores them as structured metadata you can filter and search on. That’s exactly what you want for model names, token counts, and finish reasons. Not for the full text of a user’s prompt.

If you store prompt content in a span attribute, you’ve just shipped PII into your indexed trace store. You’ve also hit a practical ceiling: span attributes are size-limited (OpenTelemetry’s default is 128 bytes per value), and a truncated prompt is worse than no prompt at all.

The solution is span events. An event is a timestamped log record attached to a span. You emit two events per LLM call:

Event: gen_ai.content.prompt
  Attributes:
    gen_ai.prompt: "You are a helpful assistant. User: explain quantum entanglement simply."

Event: gen_ai.content.completion
  Attributes:
    gen_ai.completion: "Quantum entanglement is when two particles..."

Events can be selectively dropped at the OTel Collector before they ever reach your backend. A single Collector processor rule targeting event names gen_ai.content.* lets you strip prompt/completion events in production while keeping all span attributes for cost and latency analysis. Your security team gets PII containment; your engineering team keeps full traces in staging. The only thing that changes per environment is the Collector configuration — not your application code.

Token Tracking Done Right: Input, Output, Cached, and Cache-Creation Tokens

Most instrumentation guides treat token tracking as two numbers: input and output. For providers with prompt caching — especially Anthropic — that model undercounts real costs and makes cache effectiveness invisible.

The correct attributes when using Anthropic’s prompt caching:

  • gen_ai.usage.input_tokenstotal input tokens, including those served from cache
  • gen_ai.usage.cache_read.input_tokens — tokens read from an existing cache entry (billed at ~10% of base rate)
  • gen_ai.usage.cache_creation.input_tokens — tokens written to cache on this request (billed at ~125% of base rate)
  • gen_ai.usage.output_tokens — generated output tokens

The key detail: input_tokens is a superset. It should include cache_read.input_tokens — don’t subtract cached tokens from the total. This matches what Anthropic’s API returns in its usage object and what you need for accurate billing attribution across requests.

Without these attributes, a team using a 50k-token system prompt with caching enabled will see input token counts that look expensive on every call, without realizing that 90% of those tokens are being served from cache at a fraction of the cost. Accurate cost attribution requires all four fields.

Instrumenting AI Agent Workflows: Building a Parent-Child Span Hierarchy

Single LLM calls are straightforward. Multi-agent systems are where OpenTelemetry’s trace model genuinely earns its keep.

The mental model: every agent execution is a root span with gen_ai.operation.name = "create_agent". Every tool call, LLM invocation, retrieval step, or sub-agent call it initiates becomes a child span nested under it via the standard OTel parent-child relationship. The trace view in any backend shows you the full execution tree — which tools were called, in what order, how long each took, and where it failed.

The AI Agent Semantic Conventions define six core concepts for structuring these hierarchies: Tasks, Actions, Agents, Teams, Artifacts, and Memory — based on Google’s AI agent white paper. This vocabulary applies whether you’re running a single ReAct agent or a multi-agent team where agents spawn sub-agents. Current coverage across AI agent orchestration frameworks includes CrewAI, AutoGen, LangGraph, IBM Bee Stack, and Semantic Kernel.

For custom agent loops, the pattern looks like this:

with tracer.start_as_current_span("agent.run", attributes={
    "gen_ai.operation.name": "create_agent",
    "gen_ai.agent.name": "research_agent",
}) as root_span:
    # Each tool call or LLM invocation starts a child span
    # automatically via OTel context propagation
    result = await agent.run(task)

Context propagation handles the parent-child wiring. You don’t need to pass span references manually — the OTel context manager takes care of it.

Handling Streaming Responses Without Losing Telemetry

Streaming is where naive instrumentation breaks. If you open a span, start the stream, and close it when the first chunk arrives, you’ve captured essentially nothing useful.

The correct pattern has three phases:

  1. Open the span before the stream starts. Set gen_ai.operation.name, model attributes, and request parameters immediately. The span is now tracking latency from the first token request — giving you accurate time-to-first-token measurement.

  2. Accumulate during the stream. As chunks arrive, accumulate the full completion text and track token deltas if the API streams them. Don’t emit the completion event yet.

  3. Emit the completion event and end the span after the final chunk. The gen_ai.content.completion event gets the full accumulated text. Token usage attributes get set from the final usage object that most APIs include in their last chunk.

This pattern gives you accurate end-to-end latency, full completion content, and correct token counts — all in a single well-formed span. Missing any step typically results in either zero-duration spans or missing token data that makes cost attribution unreliable.

Shipping to Any Backend — Datadog, Grafana, Uptrace, and Beyond via OTLP

This is the payoff for the vendor-neutral approach. Once your application emits gen_ai.* spans over OTLP, you point the OTel Collector at any backend with zero application code changes.

Datadog natively supports GenAI Semantic Conventions starting at OTel semantic conventions v1.37, automatically surfacing gen_ai.* spans in its LLM Observability product without custom attribute mappings. Grafana’s Tempo + Loki stack ingests OTLP traces directly, letting you build GenAI-specific dashboards using the gen_ai.* attribute namespace with standard PromQL and LogQL queries. The tracing overhead is negligible — less than 1% of main application performance thanks to async batch processing, which disappears entirely against multi-second LLM call latencies.

The Collector configuration is the only thing that changes per environment:

exporters:
  otlp/datadog:
    endpoint: "https://trace.agent.datadoghq.com"
  otlp/grafana:
    endpoint: "https://tempo.your-grafana-cloud.io"
  otlp/uptrace:
    endpoint: "https://otlp.uptrace.dev"

Your instrumentation code is identical in every environment. Switching backends is an infrastructure change, not a code change.

The GenAI conventions are experimental. That word stops some teams cold, but it shouldn’t block production adoption if you manage the transition deliberately.

The OTEL_SEMCONV_STABILITY_OPT_IN environment variable controls how auto-instrumentation packages emit attributes during the transition period. The values you need to know:

  • (unset) — Emit only legacy attribute names. Safe, but won’t produce gen_ai.* attributes in newer instrumentation packages.
  • gen_ai — Emit only the new gen_ai.* attribute names. Use this once your backend is confirmed compatible.
  • gen_ai/dupDual-emit both old and new attribute names. This is the recommended production migration path.

Dual-emit means your existing dashboards and alerts built on old attribute names keep working while you build new ones targeting gen_ai.*. Once you’ve migrated all your queries, flip from gen_ai/dup to gen_ai and stop the duplication overhead.

One practical callout: experimental status means attribute names can change before stabilization. Pin your instrumentation package versions and subscribe to the GenAI SIG changelog. Surprise attribute renames will silently break dashboards if you’re pulling the latest package on every deploy.


OpenTelemetry LLM observability gives you something no proprietary SDK can: a single instrumentation contract that works across every LLM provider, every agent framework, and every observability backend — now and as the ecosystem evolves. The gen_ai.* span attributes are stable enough for production use today, and the experimental-to-stable transition is manageable with OTEL_SEMCONV_STABILITY_OPT_IN.

The right starting point is the auto-instrumentation package for your primary LLM provider. Get spans flowing into your OTel Collector first. Then layer in events for prompt content, add the full token tracking attributes for cached calls, and model your agent workflows as parent-child span hierarchies.

Your observability infrastructure will thank you the day you swap backends or add a second LLM provider — and change exactly zero lines of instrumentation code. Start with the GenAI semantic conventions spec and pick an auto-instrumentation package that matches your stack.

Leave a Reply

Your email address will not be published. Required fields are marked *