What is agent observability?

Agent observability is the instrumentation of an agent's run as external telemetry, the traces and spans of its tool calls, decisions, state changes, costs, and outcomes, so the run can be reconstructed and debugged after the fact rather than taken on the model's own account of what it did. A span is one recorded step; a trace is the linked set of spans that reconstructs a whole run.

How is observability different from evaluation?

Observability tells you what the agent did; evaluation tells you whether what it did was correct. A trace can look clean and still describe a wrong run, because the trace records execution, not quality. They are complementary instruments: observability is the contemporaneous record that makes a failure locatable, and evaluation is the scoring that judges the run against criteria. You need both.

What should I instrument on a production agent?

The decisions that matter: tool selection, state and memory changes, hand-offs between subagents, retries, and token or cost per step, plus the structural outcome of each step. Capture the structure by default and the full content of what the agent read and wrote only by deliberate opt-in, because content carries privacy and cost weight. Instrument selectively; a trace nobody can navigate is as unhelpful as no trace at all.

Do I have to build observability from scratch for Claude agents?

No. The Claude Code CLI has OpenTelemetry instrumentation built in, and the Claude Agent SDK exports the same traces, metrics, and log events to any backend that accepts the OpenTelemetry Protocol. The work that remains is deciding what to capture, classifying your sources, and reconciling self-report against the observed record, not standing up the plumbing.

Cornerstone Guide

Agent Observability in Production: The Trace Is the Evidence

Why a production agent's run has to be captured as external telemetry, what a trace actually records, how you read one to locate a failure, and where observability stops and evaluation, provenance, and governance begin.

Last reviewed June 16, 2026

Agent observability Agent evaluation Agentic pipeline provenance Verification loop Claude Agent SDK Agentic cost control Subagent orchestration

A production agent that tells you it finished is making a claim. The trace of what it actually did is the evidence, and the two are not the same thing. A model does not have privileged access to its own run; its account of what it did is a reconstruction after the fact, not a readout of execution. I learned to hold those apart the hard way, and I wrote up the lived version of it, the operator's view of watching agents you cannot fully see, in a companion post on instrumenting observability when self-report is not enough. This guide is the layer underneath that post: the operating model for capturing a run as external telemetry, reading it, and knowing where it stops.

The discipline has a name and a definition I keep returning to. Agent observability is the instrumentation of an agent's run as external telemetry so the run can be reconstructed and debugged rather than taken on the model's word. The point isn't the volume of data collected. The point is reconstruction: when a run goes wrong and the relevant steps were captured, the trace narrows the search to where, instead of re-running it and hoping the failure repeats.

What does an agent trace actually record?

What a trace records is what makes it evidence rather than another log dump, so it is worth being precise about its parts. A trace is built from spans. A span is one recorded step: a model request, a tool call, or a subagent dispatch. A trace is the linked set of spans that reconstructs the whole run. Those spans nest, so a delegation chain that fans out across workers reads back as one connected record rather than a scatter of disconnected calls. That nesting is what makes subagent orchestration debuggable at all: a hand-off you cannot see in the output becomes a child span you can. That visibility holds only if the spans are emitted and linked correctly, so the telemetry earns the same scrutiny I apply to the agent: dropped-span and exporter-error checks, and a synthetic run that proves the parent-child links survive the runtime path.

The vocabulary for this is standardizing. The OpenTelemetry semantic conventions for generative-AI agents name the span operations an agent emits, an agent invocation (invoke_agent) and a tool execution (execute_tool) among them, and fix the attributes that hang off each one. Those attributes include the model identifier and the input and output token counts, registered in a shared attribute registry. Fixing those names in an open standard is what gives the telemetry a path to portability across backends instead of staying bespoke to one tool.

Claude is the concrete example I reach for, because the instrumentation is already built in. The Claude Code CLI emits a span hierarchy documented in its monitoring reference. An interaction span wraps a single turn of the agent loop. A model-request span wraps each call to the API, carrying the model, the latency, and the token counts as attributes. A tool span wraps each tool invocation, with child spans for the wait on a human approval and for the execution itself. When the agent spawns a subagent, that subagent's spans nest under the parent's tool span, so the full delegation chain appears as one trace. The Claude Agent SDK exports the same data, because it runs the same CLI underneath and ships the telemetry to any backend that accepts the OpenTelemetry Protocol.

The most important design decision in that surface is what it does not record. The structure of the run is captured by default; the content is not. The Agent SDK documentation states it plainly:

The content your agent reads and writes is not recorded by default. Anthropic, "Observability with OpenTelemetry"

Every span carries its type and timing; the model name, latency, and token counts ride on the model-request spans (token counts only when the API returns usage data, so a failed or aborted request may omit them), and the tool name rides on the tool spans. The full prompts, tool arguments, and tool results are a separate, explicit opt-in. That split isn't an oversight; it's the right default, because content carries privacy and cost weight that the structural telemetry does not. Decide deliberately what to record.

How do you read a trace to locate a failure?

The reason the trace matters is that an agent fails in ways the final answer hides. It can loop, call the wrong tool, or return a confident but wrong result, and none of that surfaces from the output. Traditional application monitoring was built for a different failure shape, so it catches the infrastructure faults and misses the agent-specific layer of tool choice, interpretation, and hand-off. As the Arize team puts it, traces are the source of truth for what an agentic system actually does, as opposed to what the code says it should do, and you cannot fix these failures with standard logs because the error lives in the reasoning, not in the code (their 2026 survey of agent observability tooling lays out the categories).

The most dangerous case is the silent tool failure. The Latitude team, in a guide to debugging agents in production, names it precisely: a tool returns a valid response that the agent misinterprets, corrupting all downstream reasoning without triggering any error. Every individual step looks correct in isolation. Only the multi-turn trace reconstruction reveals that the run went wrong at step three and recovered in a way that happened to produce a plausible answer. A glance at the output would have shipped that run.

Reading a trace well means knowing which layers to read. The AgentTrace framework, in a February 2026 paper, separates an agent's record into three surfaces: the operational surface of method calls, arguments, and timing; the cognitive surface of the model interactions and reasoning; and the contextual surface of outbound calls to APIs, databases, and stores. A failure usually lives in the seam between two of them, which is why a flat log loses it and a structured trace keeps it. The diagnostic question is not "did it fail" but "why," and that distinction is the whole job. The AWS team's work on failure detection with Strands draws the line cleanly: knowing that a run failed is only the beginning, and the harder question, the one a trace lets you investigate where a score cannot, is why.

That gives the discipline a repeatable shape rather than a knack. Reading a trace to locate a failure runs in roughly this order: reconstruct the span tree for the run, find the spans that look abnormal in timing or status, compare each suspect tool call's arguments against the result it returned, inspect any state or memory the run mutated, then follow the retries and the hand-offs between workers back to the first step that should have gone differently. The output is a classification, not just a culprit: whether the fix belongs in instrumentation, in the evaluation that should have scored the run, in the reliability check that should have blocked it, or in the governance that should have bounded it. A trace gives you the evidence to run that loop; it does not, on its own, prove the cause.

This is also where observability connects to the rest of the spine. The runtime evidence trail a trace gives you is the live half of a verification loop: the record you check the agent's self-report against, turn by turn. The durable move is to carry workload-specific progress invariants in the trace, the check that a run is advancing its task rather than just emitting output: a conversation-advance check for an elicitation flow, a retrieval-grounding check for a retrieval agent, a hand-off-completion check for orchestrated workers. On this site's own assessment agent, that invariant is what flags a turn that answered the user but skipped the next question it was supposed to ask. The stall reads as fine in a single response and only shows up across the trace. And the check I run against the model's own narration, counting tool-use blocks against tool-result blocks in the session transcript and alerting on a mismatch, is what surfaced Opus 4.8 misreporting the state of its own tools when the output gave no sign of trouble.

What do you instrument, and what do you leave out?

The instinct, once you have the plumbing, is to capture everything. That instinct is the trap. A trace store can be technically populated and operationally useless, and over-capture degrades the signal in two directions at once: it buries the decisions that matter under noise, and it drives a cost that pushes teams to sample away the very failures they needed to see.

The discipline is selective. Instrument the decisions that matter, tool selection, state and memory changes, hand-offs between workers, retries, and the token or cost of each step, rather than every event the runtime can emit. Where the content goes matters too. The Uptrace team's treatment of OpenTelemetry for AI systems makes the structural point: full prompt text belongs in span events, not span attributes, because attributes are always indexed and exposing large content there carries both a size and a privacy cost. There is also a quality cost to over-capture that is easy to miss. A Stanford group's analysis of agent trace format reports, drawing on prior studies, that formatting identical content differently can swing downstream performance by up to 40 percentage points, and that 40 to 60 percent of the tokens in a typical agent trajectory are redundant or expired. A trace that is too wide isn't just expensive; it is harder to navigate, riskier to expose, and less useful to the evaluators and replay tools that read it later.

Cost is one of the signals worth instrumenting precisely, because it is also a failure mode. Claude's cost and token metrics carry the model, the query source (main or subagent), and the skill that incurred the spend, which turns a single bill into a per-workload picture. The honest caveat is in the docs themselves: those cost figures are client-side estimates, computed locally from a price table, not authoritative billing data. Reading cost as a trace signal is how you catch a workload that quietly got expensive; this is the runtime face of agentic cost control, and the broader economics are their own subject in the guide on cost-aware agentic design. Not every agent warrants the full apparatus either. A single-turn call with no tool use does not need a full trace hierarchy, though it still earns request-level telemetry: prompt and model version, latency, tokens, and a correlation id. The discipline is matching the instrumentation to the failure modes the agent can produce.

That selectivity has a default shape worth making explicit. A production telemetry contract sorts what to capture into tiers, captured by default or promoted deliberately:

Tier	What it carries	Capture policy
Structure	span type, status, timing, model, token counts, tool name, retries, hand-offs	on by default
Identity	run id, session id, parent and child span links, query source	on by default
Workload invariant	the progress check this agent can fail (conversation-advance, retrieval-grounding, hand-off-completion)	per workload
Content	prompts, tool arguments, tool results, raw bodies	opt-in, scoped, with retention and access controls

The first three tiers are cheap, structural, and safe to leave on. The fourth is where privacy and cost live, so it is promoted for a scoped reason, an incident, an eval failure, a compliance need, rather than captured by default.

Where does observability end?

Observability is necessary but not sufficient, and the honest version of this guide says so plainly. A clean, normal-looking trace can still describe a wrong run, because the trace records what happened, not whether it was good. The clearest statement of the boundary I have read is from the Fiddler team's guide to what OpenTelemetry covers and where it stops:

OpenTelemetry captures what happened. It does not assess whether what happened was good. Fiddler, "OpenTelemetry for AI Observability"

That gap is not theoretical, it is the default state of production teams. In LangChain's survey of 1,340 practitioners, 89 percent reported having observability in place, 52.4 percent ran offline evaluations, and only 37.3 percent ran online evaluations: most teams have built the record, about half score it offline against cases, and far fewer close the loop on the runs actually serving users. The fix is not more tracing. It is pairing the trace with agent evaluation, which scores the run against criteria. As LangChain frames it in a piece on how observability powers evaluation, offline evaluation is necessary but not sufficient, and to evaluate behavior you have to evaluate your observability data; the trace is the input to the eval, not a substitute for it.

Observability is also distinct from provenance, and the distinction is worth drawing because they are easily conflated. A recent survey of evidence tracing and execution provenance puts the line this way: traces record what happened, while provenance explains how execution artifacts are connected. Observability is the live record of the run; agentic pipeline provenance is the stamp on a finished artifact saying which model, which skill, and which validators produced it. The same trace data often feeds both, but they answer different questions and run at different times. The four disciplines line up like this:

Discipline	The question it answers	When it runs
Observability	What did the agent actually do?	Live, during and after the run
Reliability	Does the output hold under a check?	The verification loop, every run
Evaluation	Was the run correct, scored against criteria?	Before and after a change, on cases
Provenance	How was this finished artifact produced?	Stamped on the artifact

Governance sits above all four. The accountability layer, who owns a failing agent, what policy bounds it, and what audit trail proves compliance, depends on the trace but is not the trace; I treat it as its own subject in the guide on agentic AI governance in production. And reliability, the argument for why the record has to keep running rather than run once, is the companion guide on treating agent reliability as a loop.

One caveat belongs on the standards themselves. The OpenTelemetry generative-AI conventions are still, in the project's own status vocabulary, in development, and the project is explicit that while signals are in development, breaking changes may occur. The attribute vocabulary has shifted across recent releases and was migrated to a dedicated repository. Use the conventions where they fit, and treat instrumentation written against today's draft as provisional. The principle underneath, capture what the agent did from outside the agent, outlasts any particular attribute name.

How does this page stay current?

This cornerstone is the deep companion to the agent observability glossary entry and a peer of agent reliability in production and agentic AI governance in production on the same site. Its primary artifact is a first-party operational record, the telemetry I run on this site's own production agent and authoring pipeline, that lives next to the body and is updated as the practice evolves. The Sources roster tracks the freshness of each external anchor under the three-month AI/SaaS cap and the six-month tool-capability cap that govern this site's authority pages; a row past its cap is held only when a sourced search trail documents what was looked at and why nothing fresher qualified.

When I scope a consultation, workshop, or implementation engagement around running agents in production, the first artifact I build with a team is this layer: the trace, the telemetry contract, and the workload invariants that make their agents' behavior legible on their own stack. The discipline is not a dashboard you buy. It is the decision about what evidence you keep before you need it, made deliberately rather than discovered during the incident that needed it.

What does an agent trace actually record?

How do you read a trace to locate a failure?

What do you instrument, and what do you leave out?

Where does observability end?

How does this page stay current?

Sources

Related guides