What is subagent orchestration?

Decomposing a development task across multiple Claude Code subagents that run independently under a coordinating orchestrator, with explicit context isolation and result aggregation. Each worker runs in its own context window with its own permission scope, so what one worker reads, hallucinates, or fails on stays inside its lane.

Why is isolation the load-bearing property, not parallelism?

Parallelism saves wall time. Isolation prevents the failure modes a single-context agent compounds at scale: runaway exploration flooding the context window, hallucination leaking forward into later reasoning, and over-broad permission scope persisting across unrelated steps. The speedup is a side effect; the trustworthiness is the reason to pay the dispatch cost.

Cornerstone Guide

Subagent Orchestration in Production: Trade-offs and Failure Modes

Q: When should I not orchestrate?

When the task has fewer than two independent dimensions, when each dimension is short and bounded, or when the dimensions are tightly coupled in a way that loses meaning once decomposed across workers. Orchestration is an isolation guarantee with a real dispatch cost; it is worth paying when it prevents a failure mode that would otherwise compound, and not worth paying as a default.

Q: How does the orchestration layer itself fail?

Two named ways. A worker can blow its own context on a broad-scope dispatch, and the orchestrator's parallel lanes can return conflicting findings that the orchestrator must reconcile at the gate. Both have structural mitigations: scope decomposition before dispatch, and human-in-the-loop quality gates at reconciliation.

Why isolation per worker is the load-bearing property of subagent orchestration, where the pattern stops paying, and how the orchestration layer itself fails.

Last reviewed May 23, 2026

Subagent orchestration Agentic pipeline provenance Production agentic delivery Tool use

Why isolation, not parallelism?

Most write-ups treat subagent orchestration as a parallelism trick: fan out work, wait for results, save wall time. That framing buries the operational point. I run the pattern on this codebase across three production skills, and the lesson that keeps repeating is that the speedup is the cheap benefit. The benefit that justifies the dispatch overhead is something quieter. Each worker holds its own context window, and what happens inside that window cannot leak to any other worker. Parallelism saves minutes. Isolation makes the work trustworthy.

The subagent orchestration glossary entry this cornerstone extends puts it directly. The official Claude Code subagents documentation is even terser: each subagent runs in its own context window with independent permissions, and the lead receives only the digested result rather than the full intermediate token stream. The shape is orchestrator-worker. The Anthropic engineering team catalogues the same pattern under that name in a recent piece on multi-agent coordination, where the architectural framing is explicit:

Each subagent operates in its own context window and returns distilled findings. Anthropic, "Multi-agent coordination patterns"

What makes that sentence operationally important is the phrase "its own context window". That is the property the pattern is built around. Parallelism is what falls out when isolated workers happen to be runnable concurrently; it is a downstream consequence of the isolation, not the reason the isolation matters.

This is the same diagnostic line I drew on the AI coding spectrum: the trustworthy version of agentic development is the one where structural guarantees do the work, not the one where the agent is asked to be careful. Isolation per worker is one of those guarantees. Parallelism is not.

What single-context failure modes does orchestration contain?

Three, and they are the same three the glossary entry names. Each bites at a different scale, and each is the reason a serious team eventually moves to the pattern.

First: runaway exploration flooding the context window. A single-context agent on a real engineering task drifts. It reads a file, follows a reference, finds another file, opens a search, reads three more files, summarizes, opens another search, and around twenty tool calls in the context is full and the agent hasn't produced anything yet. The final response either gets truncated or fails outright with Prompt is too long. I watched this happen in this very PR's session: an internal-xref dispatch I sent in parallel with the topic-research lanes exited with that exact error after twenty-one tool uses in roughly two and a half minutes, before its final response reached disk. The mitigation is to move the exploration into a worker whose only job is one dimension. The worker either succeeds or it fails inside its own lane; the lead receives a short summary instead of the full transcript, so the lead's context stays clean for the next decision.

A second class of failure surfaces when one step's output silently feeds the next: hallucination contamination across steps. A single-context agent that hallucinates a fact in step three of a task carries that fact forward into steps four, five, and six. The hallucination compounds because the agent's later reasoning sees its own earlier wrong output as ground truth. With orchestrated workers, a hallucination is bounded. A worker that goes off the rails in its own lane can't reach into another worker. The lead reconciles distilled findings without ever holding the intermediate token stream, so corruption stays inside the lane that produced it. This is the failure mode that makes orchestration worth the dispatch cost on any task with multiple independent dimensions.

The third bites at a layer most write-ups skip: permission scope drift. A single-context agent granted the union of permissions every step needs holds all of them for the entire run. If step one needs read-only file access but step five needs network calls, the agent carries both for the whole task, including during step three, which needs neither. Per-worker permission scoping shrinks the surface where a misstep matters: a worker that does something unexpected can only reach what its scope allowed, not what the union allowed. You don't feel this failure mode until a worker does something you didn't expect and you're glad it couldn't reach the network. The Cloud Security Alliance's 2026 survey of AI-agent scope violations found that fewer than one in ten organizations report their AI agents never exceeding intended permissions, and more than half see exceedance at least occasionally. The cost of letting the surface drift is empirical, not hypothetical.

The Anthropic engineering team's deeper write-up on their multi-agent research system, the longer-form companion to the orchestrator-worker framing, makes the same operational point about per-worker context windows being the load-bearing element of the architecture, not the parallelism. The DORA team's 2024 State of DevOps report found AI adoption associated with higher individual productivity but estimated decreases in software delivery throughput and stability. DORA does not study orchestration directly, but that productivity-up, stability-down split is exactly the gap isolation is one structural answer to.

When does orchestration stop paying?

The dispatch and reconciliation overhead isn't free. Every worker dispatch costs setup tokens, the worker's own context, and the orchestrator's time reconciling distilled findings. On a small task with one dimension and no contamination risk, that overhead can exceed the isolation value.

A rough sizing rule:

If the task has fewer than two independent dimensions, the overhead is wasted; run the work in a single context.
If the task has two or more independent dimensions and any one of them might drift on its own (a research dimension with many sources, a validation lane with cross-document state, a code surface where a hallucinated identifier would compound), orchestrate.
If the task has two or more independent dimensions but each one is short and bounded, the lead's reconciliation cost dominates; keep the work in-thread.

The honest version of the trade-off is that orchestration is not a productivity multiplier. It is an isolation guarantee with a measurable dispatch cost. That distinction matters because most write-ups frame the pattern as a speedup, and a buyer or an engineering lead who adopts it expecting a speedup ends up disappointed when small tasks slow down. The pattern is worth paying when the isolation prevents a failure mode that would otherwise compound. It's not worth paying as a default.

Cognition's engineering team came to a sharper version of this rule from the other direction. After an initial "Don't Build Multi-Agents" position, their April 2026 follow-up Multi-Agents: What's Actually Working operationalized the trade-off: multi-agent systems work best, in their experience, when writes stay single-threaded and the additional agents contribute intelligence rather than actions. The mechanism they name for the failure case, "Context Rot," is the same dynamic the single-context failure modes section above describes from the other side: model decisions degrade as context length grows, and a single agent holding every dimension hits that wall before an orchestrated set of workers do. Their operational rule and the rule I run on this codebase converge on the same boundary.

The academic framing of the same trade-off is in Bhatt et al., When Should We Orchestrate Multiple Agents?, which states the decision criterion in one line: orchestration is only effective if there are performance or cost differentials between agents. Without measurable capability or cost differences between the orchestrator and its workers, the orchestration overhead produces net-negative economics. That criterion holds whether the workers are scoped by topic, by tool access, or by model size; what it rules out is orchestration as a default.

The governance side of the same question is well-documented. The Deloitte 2026 State of AI in the Enterprise survey found that only about one in five organizations pursuing agentic AI report mature governance for it. Mature governance is largely the discipline of knowing when to orchestrate and when not to: the same survey's efficiency-gain figures are widely reported, but the value gap (most orgs see efficiency, very few see revenue impact) tracks the gap between adoption and structural use.

How does the orchestration layer itself fail?

The pattern has its own failure modes, and skipping over them is what makes most treatments feel thin. Two have bitten the orchestration on this codebase enough to be tracked structurally.

First: the orchestrator's own workers can blow their own context. A topic-researcher dispatched with a broad multi-dimension prompt (four vendor delta dimensions, around twenty WebFetch calls projected) overran its own context window roughly five and a half minutes in, exited with Prompt is too long, and left no findings on disk. A continuation message against the same agent crashed in single-digit milliseconds with the same error: zero salvage. The mitigation is structural. Any broad-scope dispatch is split into narrow parallel lanes emitted in a single message, and each lane is scoped tightly enough that its worker's context stays inside its budget. The cornerstone you're reading was authored under that discipline. Three narrow research lanes plus an internal-xref dispatch were sent in one parallel message, none of them broad enough to blow the workers' context, except for the internal-xref lane (designed for a smaller corpus), which did exactly the same thing in real time. This is the same blowup I named above as evidence that orchestration contains failure: the isolation did hold (no other lane was contaminated), and the orchestration layer still paid the cost (the worker's output was lost and the orchestrator had to absorb the gap). Both framings are true at once. That is what an honest failure mode looks like.

The second failure mode shows up at the reconciliation seam: conflicting findings cost the orchestrator's own attention. When five reviewer subagents fire in parallel against a single draft, their findings can conflict. A brand-prose reviewer demands an edit that a coherence reviewer would forbid; a fact-checker flags a claim that a skeptic-reader treats as load-bearing. The reconciliation falls on the orchestrator at a quality gate, not on any individual reviewer. The cost is small per occurrence but is real, and it grows with the number of parallel lanes. The pattern absorbs the cost because the alternative, a single reviewer holding every dimension, would lose the isolation guarantee that motivated the orchestration in the first place. The METR randomized study of AI's impact on experienced OSS developer productivity is a useful sanity check here: experienced developers predicted AI would speed them up by around a quarter, still believed they had been sped up by roughly a fifth after the tasks were done, and were actually measured slower by about nineteen percent. The perception gap survived direct experience in the same direction. Multi-agent reconciliation has the same shape: a reviewer's self-grading of its own output is not the gate. Trust the gate, not the agent.

There's a third class of failure modes I'm not enumerating individually because each is a specific instance of one of the two above. A worker that crashes for token-budget reasons rather than scope-breadth reasons is still a worker blow-up; a downstream pipeline that consumed a worker's distilled finding without checking the worker's status is still a reconciliation conflict the orchestrator didn't absorb. The structural cure is the same in both: the orchestrator owns the gate, and a worker that can't return a clean digest is treated as not having returned at all.

The academic literature has converged on the same shape of failure surface. Cemri et al., Why Do Multi-Agent LLM Systems Fail?, classify more than 1,600 execution traces across seven multi-agent frameworks into fourteen named failure modes grouped under three structural causes (system design, inter-agent misalignment, task verification) and conclude that performance gains on popular benchmarks are often minimal without those structural cures. The two failure modes named in this section are the operational form of that finding. The cornerstone runs the pattern anyway because the isolation it buys, on the tasks it is designed for, is structurally unobtainable in a single context. The MAST taxonomy is the cost of running it badly; the disciplines named here are how it is run well.

There is a sharper counter to address before this section closes. Tran and Kiela's April 2026 Stanford paper, Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, shows that on multi-hop reasoning a single-agent system scores 0.418-0.427 accuracy versus sequential multi-agent at 0.379-0.389 when total compute is held constant, and argues that many reported multi-agent advantages are computation-and-context-effect artifacts rather than architectural wins. The paper is right on the case it tests. The qualifier the paper itself surfaces is the load-bearing one: the single-agent advantage holds only under low context degradation. Under heavy context corruption, the multi-agent system becomes competitive again. The condition that lets the single agent win, in other words, is exactly the condition that long-running single-context agents progressively destroy in production. The cornerstone's thesis is unchanged: isolation is the load-bearing property because production single-context agents don't stay in the clean-context regime where the steel-man holds.

This is also the failure mode I documented as the AI authoring trust chain: a chain of deny-by-default gates between an agent and a shipped artifact. The orchestrator owns the gate. The worker's job is to fail inside its lane when the inputs are bad; the gate's job is to fail closed when the worker's output is.

When should I not orchestrate?

This is where the thesis becomes operational. The hardest cell in the matrix below is "tightly coupled reasoning," and a useful tell for it: whether you can describe each lane's output in one sentence without referencing another lane's output. If you can't, the lanes aren't independent and orchestration will fight the reasoning instead of containing it.

Situation	What I do
Single short task, one dimension, low contamination risk	Run in-thread. Orchestration overhead wasted.
Multiple short tasks, independent, low contamination risk	Run in-thread serially. Overhead still wasted unless any one task might drift.
Multiple tasks with independent state where any one could hallucinate or drift	Orchestrate. Isolation is the win.
One large task with many sources to read	Decompose into independent lanes per source set. Orchestrate.
One large task with many sources but tightly coupled reasoning	Do not orchestrate. The coupling crosses workers and reconciliation loses the thread.
Validation wave with several specialist reviewers per artifact	Orchestrate as a parallel wave. Independence per reviewer is the guarantee.
The task writes files	Keep all writes on the orchestrator. Workers stay read-only.

The rule that sits behind every row is the same. Orchestrate when the isolation prevents a failure mode that would otherwise compound. Do not orchestrate when the coordination cost is higher than the failure mode you are protecting against.

How does this page stay current?

This cornerstone is the deep companion to the subagent orchestration glossary entry and a peer of Running Claude Code as a Production Engineering Practice, the parent cornerstone on the same site. The anchor is its primary artifact, a first-party operational record that lives next to the body and is updated when a new failure mode is observed or when an existing mitigation evolves. The Sources roster tracks the freshness of each external anchor under the 3-month AI/SaaS cap and the 6-month tool-capability cap that govern this site's authority pages; a row past its cap is held only when a sourced search trail documents what was looked at and why nothing fresher qualified.

Two adjacent practices sit underneath the pattern this page describes. Production agentic delivery is the wider operational mode that orchestration sits inside: agentic work that ships, gated by deterministic validators rather than ad-hoc review. Agentic pipeline provenance is the per-artifact record that makes orchestrated work auditable: which model, which skill, which knowledge-base state, which validators passed. Tool use is the per-worker capability that makes each isolated lane actually do something. The orchestration shape sits at the intersection of all three, which is also why a treatment focused only on parallelism misses what the pattern is for.

When I scope a consultation, workshop, or implementation engagement around agentic development, the orchestration pattern, and the structural mitigations for its named failure modes, are part of what I ship.