An agent telling you it finished is not the same as knowing what it did. I learned to hold those apart the hard way, running eight to twelve Claude Code sessions at once across terminal tabs and trusting the wrong signals about which ones were actually working. The gap between "the agent says done" and "the work happened" is where an unsupervised agent quietly costs you.
Anthropic has been closing the visible-surface part of that gap fast. Agent View shipped this spring, and Claude Code's OpenTelemetry support has matured over the past year into metrics, events, and traces. Together they cover more than most write-ups have caught up to. So this is not a "nothing exists, build it yourself" post. First, the one principle that holds no matter which dashboard you use. Then an honest map of what the native surfaces now give you, the narrow slice they still leave, and what building a small instrument for that slice taught me about observing agents you cannot watch.
Self-report is not observability
Start with the part that does not change when a vendor ships a new dashboard. A model does not have privileged access to what it just did. Its account of its own tool use is a reconstruction after the fact, not a readout of execution. I watched this concretely when Opus 4.8 began misreporting the state of its own tools, confidently, in a way the previous model never did. The research community has been blunt about the general case: there is good reason to be skeptical that language models can introspect on their own internal states at all. A "done" from an agent is a claim, not evidence. The same structural blindness is why a transcript-scoped model reviewer cannot be your gate: it shares the generator's blind spots.
The distinction worth keeping is narrow and exact. The unreliable thing is post-hoc model narration. It is not all agent-emitted signal: a structured usage block, a tool-call log, a stop reason, a process-liveness check are caller-side measurements, and those are exactly the external evidence worth trusting. The exception proves the rule. Feed an execution ledger back into the model's context and its summary can be sound, because it is reading telemetry rather than introspecting. It is the unaudited natural-language "done" you distrust, not every agent-produced summary.
What the agent's self-report tells you
- A natural-language summary of what it thinks it did
- A 'done' that may or may not match execution
- Reconstructed, not observed; confidently wrong is possible
- No independent check on tool calls or cost
What you can only see from outside
- Token usage and cost, per turn
- Tool calls made vs tool results returned
- Stop reason, latency, which fallback fired
- Whether the process is even still alive
So the floor is this. Observability is what you observe from outside the agent. Self-report is what the agent tells you. When they agree, fine. When they diverge, prefer the independently collected record, then reconcile its own failure modes; an external signal is higher-confidence evidence, not magic ground truth. Everything below is about getting that external view, and it starts by being honest about how much of it you no longer have to build.
What the Claude stack actually gives you now
If you have not looked in a few weeks, the native surface is wider than the busy/idle status the raw session registry on disk exposes. There are four real surfaces today, and a fair reading of them is the fastest way to find the slice that is genuinely still yours to build.
The honest read: the dashboards are good and getting better, and for backgrounded work or a team that already runs a telemetry backend, configuring Agent View and CLI OpenTelemetry is the right first move, not building anything. The remaining gap is a specific operator UX, not a void. Naming it precisely is what keeps the rest of this post from overclaiming.
The slice that is left, and the glance I built for it
So I built a small instrument for exactly that slice, and the lesson is in how little it took. I wanted one thing: a glance, from my Mac menu bar, that told me the state of every attached session without my switching to it or detaching it into Agent View. The result shows one indicator per session, colored by three operator states, with a sound when a session changes to needing me, and a click that jumps straight to that session's terminal tab.
The mechanism is the interesting part, and it is where I had to be careful, because the obvious signal is the wrong one. There is no API that hands you working, blocked, and responded for an attached session. What there is, instead, is the Claude Code hook event stream, a set of lifecycle callbacks. The trap: PreToolUse fires before every tool call, including auto-approved ones, so it is not by itself a "waiting on you" signal. The mapping is small: PermissionRequest or an idle-prompt Notification means blocked and waiting on you, Stop means the turn finished, and active tool work means working. Fold those into a small state machine and you reconstruct the three states from the outside. Liveness comes from a separate channel, a process check, because the reported status will lie about a session that died.
Call it what it is: a heuristic, not a guarantee. Hooks can be missed, reordered, or renamed across Claude Code updates; a session can hang inside a long tool call with no event at all. So the machine needs an unknown state, a heartbeat timeout that flips a silent session to stale, and a regression test against real transcripts every time the tool updates. That maintenance is not a footnote. A bespoke instrument you do not keep current is worse than the native view you skipped, which is the honest case for configuring Agent View or CLI telemetry first and building only the thin adapter the native surfaces do not cover.
The transferable move is smaller than a menu-bar app. Find the event stream your runtime already emits, decide which combination of events means the state you care about, compute that state outside the agent, and treat the mapping as a heuristic you maintain.
When you can't watch a screen, own the telemetry contract
The glance works because I am at the machine, and to be exact about my own definition it is an attention layer, not observability by itself: it routes me to the session that needs me, while the durable, queryable record lives elsewhere. The harder case is the agent I cannot watch: a production Claude agent serving users, with no tab to look at and no human in the loop by design. There, the record has to carry the whole weight.
I run one. The AI Readiness Assessment on this site is a roughly twenty-turn agent that routes across models on a Cloudflare Worker. It reports on itself from the outside: every turn writes a telemetry row with token counts, duration, the effort level used, and which routing path produced the response (primary model, fallback model, or error path), including whether a fallback fired and why. An operator kill switch for the most expensive model propagates in about a minute. None of that is the agent telling me it is fine. It is the loop writing down what happened, turn by turn.
That telemetry earned its place by catching a behavior I would not have seen from the output alone: the agent answering a turn without asking the next question, quietly stalling a conversation that is supposed to be eliciting input. The check that catches it is a cheap heuristic, not a model judge: during an elicitation phase, a turn whose response carries no question back to the user gets flagged. From a single response it looks fine. Across the per-turn record it is what it is, a turn that did not advance the task.
Here is the part the "build it yourself" framing gets wrong if you stop at the menu-bar app. The deliverable is not a tool, it is a telemetry contract: the set of signals every agent must emit regardless of where they land. When you control the orchestration loop, own the contract and the backend is a choice: a third-party platform like Langfuse, Datadog, or Arize is a perfectly good sink, as long as it receives your orchestration signals and not just model traces. For a managed or black-box agent you do not control, the move is different. Check what telemetry the platform exports, its retention, and its alert hooks before you treat it as production-ready.
| Layer | Capture | Notes |
|---|---|---|
| Baseline (you own the loop) | Per-turn tokens, latency, model and effort, fallback reason | Claude agents where you own the orchestration loop |
| Baseline (you own the loop) | Tool calls vs tool results, stop reason, errors, escalations | The execution record you check self-report against |
| Baseline (you own the loop) | Run id, session id, state transitions, heartbeat | What lets you correlate and detect a stuck run |
| Workload-specific | Conversation-advance check | Elicitation and chat flows, not batch or retrieval agents |
The vocabulary for some of this is standardizing. The OpenTelemetry GenAI semantic conventions define agent spans and a conversation id, and name a provider value for Anthropic. They are also still marked as in development, and they do not yet cover the operator states this post cares about, like which local session is blocked or waiting on a human. Use them where they fit. The discipline underneath is the runtime half of a verification loop: keep an evidence trail for each turn, and that trail is what agentic pipeline provenance and agentic governance in production depend on later.
Build the right amount, own the signals
The pattern under all of this is one sentence: capture what the agent did, from outside the agent, before you need to debug what it did. Self-report does not give you that. A green transport-level dashboard that treats a looping agent and a working one as the same "busy" does not give you that either. The native Claude surfaces now cover much of it, and the rest is a contract you own and a thin adapter you maintain.
A build-vs-buy rule of thumb, in three steps. First, turn on Agent View and CLI OpenTelemetry. Second, route a telemetry contract into the backend your team already governs. Third, build a bespoke instrument only for the operator UX the native surfaces miss, and give it an owner, a heartbeat, and a regression test against Claude Code updates. A record you can query, alert on, correlate with outcomes, and retain safely is observability. A record you only write is the telemetry substrate it is built on.
This is scoped to the operator-state and per-agent-telemetry layer. A team rollout needs more on top: ownership routing, audit logs, access controls, retention, and alert triage, the kind of governance a personal status light does not pretend to solve. But the starting move is the same whether you run a dozen local sessions or one Worker agent in production: find the signals you are missing, decide where they land, and stop trusting the agent's word for what it did. The reliability case for why that record has to keep running, not run once, is its own argument, and I made it in the guide on treating agent reliability as a loop, not a one-time test.
When I scope an engagement around running Claude in production, building this layer for a team's own agents, on their own stack, is part of what I ship. The advisory and implementation work I do often starts with one question: do you know what your agents are actually doing right now? If the answer is a shrug, that is the finding. A focused fifteen-minute conversation is usually enough to tell whether your green dashboard is telling you the truth.
FAQ
Is an agent's own "done" enough to know what it did? No. A model does not have privileged access to what it just did; its account of its own tool use is a reconstruction, not a readout. The reliable signal is what you observe from outside the agent: token usage, tool-call traces, stop reasons, process state. When self-report and the observed record diverge, only the observed record is data.
What should I instrument for a production AI agent? A universal minimum: per-turn token counts, latency, which model and effort ran, which fallback path fired and why, a tool-call trace that counts tool_use blocks against tool_result blocks, and error and escalation events. Then add workload-specific invariants, like a conversation-advance check for a multi-turn elicitation agent. Capture what the agent did, not what it said it did.
What is the difference between agent observability and agent evaluation? Evaluation measures quality against a set of cases, usually before or after a change. Observability is runtime visibility into what an agent is doing while it runs. Evaluation answers 'is it good?' Observability answers 'what is happening right now, and what just happened?' You need both, and they are not the same instrument.
Don't Agent View and the Agent SDK already give me observability? They cover a lot now. Agent View (research preview) shows backgrounded Claude Code sessions with rich states; the CLI exports OpenTelemetry telemetry; the Agent SDK adds traces; Managed Agents adds a hosted console. The slice they leave is an always-on glance over the attached terminal sessions you are actively working in, without backgrounding them or running a collector. And the principle holds regardless of surface: self-report is not observability.