The Four Sub-Agent Orchestration Patterns That Cover 90% of Production Claude Workloads

A multi-agent Claude system uses roughly 15 times more tokens than a chat conversation. Anthropic published that number themselves. Same post also reports a 90.2% performance gain on research queries. Both numbers are accurate. Neither tells you whether to build one.

That is the strange place Claude teams find themselves in as of April 2026. The primitives are public: Claude Code sub-agent files (.md under .claude/agents/), Agent Teams for shared task lists, Managed Agents for API-callable agent sessions, and parallel tool_use blocks in the Messages API. Five named workflow patterns shipped in the December 2024 research post. Opus 4.7 tuned its defaults specifically for these loops.

And yet there is no consolidated reference architecture. Building Effective Agents tells you when to use each individual workflow, but nothing maps Claude Code sub-agents, Agent Teams, and Managed Agents into one decision picture with cost and coordination-failure modes. No coordination-failure taxonomy. No cross-workload benchmark data beyond the 90.2% research number.

Teams I talk to get to this point and stall. They have moved past one-engineer-one-Claude usage. Coordinating multiple agents feels like the next step. The Anthropic docs give them the pieces but not the picture.

This post is the picture. Four patterns that cover the overwhelming majority of production workloads that have outgrown a single Claude loop. Decision criteria. Cost math. The cases where the patterns fail. Every claim is tied to an Anthropic primary source or a named practitioner I trust. Each of these four is a specific, deployable form of subagent orchestration. The subagent orchestration in production guide is the evergreen version of that operating model.

The Four Patterns, Up Front

If you are going to close this tab in ninety seconds, read this section and go.

Parallel fan-out. Scatter independent work across N peer sub-agents and gather results at the parent. Highest payoff when tasks are truly independent, coarse-grained enough to amortize context overhead, and fit a shared deadline. Claude Code's /batch ships this pattern as a bundled skill.
Sequential review chains. Sub-agent A produces, sub-agent B reviews or critiques, the parent integrates. Useful as a structured quality gate when the producer and reviewer need different framings or different tool access.
Adversarial dual-analysis. Two sub-agents with explicitly opposing prescribed framings produce independent analyses, and the parent synthesizes or picks the stronger argument. Useful for diagnostic questions where evidence is ambiguous and you want the disagreement to surface.
Hierarchical planner-executor. A planner sub-agent decomposes work into tasks, executor sub-agents run them, a synthesizer integrates. This is Anthropic's "orchestrator-workers" pattern under a name that clarifies the two distinct roles.

The decision matrix. One precondition gates every row: reach for a pattern only once a flat single-agent loop has measurably failed a quality, latency, or cost bar. If it has not, the answer is to stay flat-loop, and the steel-man near the end of this post explains why. I'm going to earn every row in the sections that follow, but a technical leader scanning for "which one do I pick on Monday" should have the answer without scrolling through 3,000 words first.

Pattern	When to reach for it	Cost tier	Latency profile	Primary failure mode
Parallel fan-out	Independent subtasks, shared deadline, cost of N-way token spend justifies walltime gain	N × single-agent	Bounded by slowest peer	Uniform bottleneck if tasks are not decomposed finely enough
Sequential review chains	Quality gates with different reviewer framing; producer and reviewer need different tools	2× to 3× single-agent	Additive across agents	Reviewer rubber-stamps the producer instead of disagreeing
Adversarial dual-analysis	Ambiguous evidence, architectural judgement calls, debugging competing hypotheses	2× to 3× single-agent	Bounded by slowest analyst + synthesis	Prescribed framings collapse into agreement, surfacing nothing
Hierarchical planner-executor	Genuine decomposition benefit, heterogeneous work, long-running work that benefits from a plan	1 planner + N executors + 1 synthesizer	Planner time, then parallel executors, then synthesis	Plan rigidity: planner commits before the work surfaces what the tasks really need

Every row has a failure mode column for a reason. The steel-man at the end of this post is built almost entirely from Anthropic's own cautionary quotes, and the failure modes are where those quotes hit hardest.

Why Anthropic Hasn't Published This (and What It Costs You)

Precision matters here. The gap between what exists and what does not is the whole reason this post needs to exist.

What Anthropic publishes today:

Architectural primitives. The sub-agent file format (.md under .claude/agents/ with YAML frontmatter). The Agent tool (renamed from Task in Claude Code v2.1.63). Agent Teams with a shared task list and peer mailbox. Managed Agents with a multiagent coordinator roster and isolated session threads. Every one of these is documented.
Five named workflow patterns. Prompt Chaining, Routing, Parallelization, Orchestrator-Workers, and Evaluator-Optimizer. Published in the December 2024 post Building Effective Agents. The patterns still hold up. The post predates Agent Teams and Managed Agents. Two of the five are single-agent control flow, not sub-agent coordination, so this post sets them aside on purpose: Routing (classify a request, then dispatch it) and Prompt Chaining (run one agent through fixed sequential steps). The other three map onto the four patterns below. Parallelization splits into parallel fan-out and adversarial dual-analysis, Orchestrator-Workers becomes hierarchical planner-executor, and Evaluator-Optimizer is the iterative form of sequential review chains.
Two production case studies. The multi-agent research system behind Claude's Research feature. The C compiler built with parallel Claude sessions. The first reports 90.2% improvement over single-agent Opus 4 and the 15x token figure. The second spent roughly $20,000 in tokens across 2,000 Claude Code sessions to build a 100,000-line Rust-based C compiler.
Version-specific notes. Opus 4.7 spawns fewer subagents by default. It also exposes an advisory task_budget for agentic loops. Sonnet 4.6 and prior are more aggressive delegators.

What Anthropic does not publish:

No consolidated decision framework. Guidance on when to use which of the five patterns is scattered across four posts and two docs pages. No single document says "here is how to decide."
No named adversarial-dual-analysis pattern. The closest reference is the Agent Teams "investigate with competing hypotheses" use case. Structurally it is the pattern. It just has no name in Anthropic's material.
No coordination failure-mode taxonomy. The effective harnesses post catalogs four single-agent failure modes. There is no equivalent for the coordination failures that happen once you are running multiple agents at once.
No cross-workload benchmark data. The 90.2% number applies only to internal research tasks. Anthropic notes that coding workloads have "fewer truly parallelizable tasks than research." No comparable figures exist for code review, data analysis, content production, or customer support.

The cost of this gap is predictable. Teams reach the point where a single Claude loop stops scaling and grab the nearest scaffolding they can wire together. Six to twelve months later they're still rediscovering patterns that already have Anthropic-published primitives. Or worse, they decide the complexity isn't worth it and quietly give up on the workload.

The rest of this post is the framework that closes the gap.

Pattern 1: Parallel Fan-Out

Parallel fan-out scatters independent work across N peer sub-agents. The parent gathers the results and synthesizes. It is the subtask-splitting variant of what Anthropic calls Parallelization in the 2024 research post. Engineers from a cloud-architecture background will recognize it as fan-out / fan-in or scatter-gather. I use either name depending on the audience.

The Anthropic primitives. Claude Code dispatches sub-agents through the Agent tool. Definitions live as .md files in .claude/agents/. Each sub-agent runs in its own context window with its own system prompt and its own tool allowlist. Agent Teams is an experimental Claude Code feature as of April 2026. It adds shared task lists with file locking plus a peer mailbox. The mailbox matters when teammates coordinate rather than just report back. At the Messages API level, parallel tool_use blocks inside a single assistant turn execute concurrently. That is the fan-out primitive one level below sub-agents.

There is a load-bearing implementation detail here. If you dispatch N Agent tool calls across separate assistant messages, they serialize. If you emit all N in a single message, they run concurrently. I probed this on 2026-04-17 with three native Claude Code sub-agents. The parallel case ran in 14.87 seconds. The serial case took 43.8 seconds. The three-times speedup is structural at the harness layer. You do not get it by asking nicely.

When to reach for it. Research queries with independent sub-questions. Multi-dimension audits where each dimension has its own rubric. Content generation across separable topics. Anthropic's research system pairs a Lead Researcher Agent (Opus 4) with 3 to 5 parallel Sonnet 4 subagents, a configuration from the June 2025 post that may have evolved with newer models. The post attributes "up to 90% for complex queries" in research-time reduction to parallel tool calling across those subagents. That gain is the payoff you are buying with the token multiplier.

When to avoid it. Tasks with shared state. Sequential dependencies. Strict determinism requirements. Agent Teams' own docs call out sequential tasks, same-file edits, and heavy inter-agent dependencies as the anti-cases.

How it fails. Fan-out without fine-grained decomposition. The Anthropic C compiler post documents the clearest example. Their first attempt gave all 16 parallel agents the same monolithic task: compile the Linux kernel. Every agent converged on the same bottleneck. Parallelism added zero value. The fix was not more agents. It was finer decomposition using a GCC oracle as the test harness. The lesson generalizes. Fan-out is a quality-of-decomposition test, not a quality-of-coordination test.

If you want to see this pattern running today, the five foundational API patterns post covers the tool-use and structured-output primitives that make fan-out reliable. Without typed tool contracts, sub-agents drift.

Pattern 2: Sequential Review Chains

Sequential review chains pass a produced artifact from one sub-agent to a second. The reviewer has a different framing and different tool access. The reviewed output returns to the parent for integration. There are two shapes. The simplest is a single-pass review: produce, review once, ship. The iterative shape loops review and revise until a quality signal clears. They look similar and fail differently, so name both before using either. The iterative shape is what Anthropic calls the Evaluator-Optimizer pattern, and what Microsoft Azure calls the maker-checker loop. Claude Code's /advisor command is the same idea compressed into one API request: a stronger reviewer model called by the executor mid-task, returning advice into the same turn. That review shape only works if the check itself earns trust; the companion agent reliability in production guide goes deeper on the verification loop.

The Anthropic primitives. The sub-agents documentation uses this pattern as a canonical example: "Use the code-reviewer subagent to find performance issues, then use the optimizer subagent to fix them." That is the single-pass shape. The iterative shape is the Evaluator-Optimizer pattern from the 2024 research post. One LLM generates, another evaluates, and the loop runs until the evaluator's criteria clear. Both shapes need asymmetry. Give the reviewer a different system prompt, a different tool allowlist, or both. A reviewer with the same prompt and the same tools as the producer is asking the producer to grade its own homework.

When to reach for it. Quality gates where the review criteria are stable. Security review on code generated without security context. Accessibility checks on interface code generated without accessibility context. Fact-check passes on content generated without the reviewer's source material. A content-publishing pipeline can wire this in as three reviewer sub-agents that fan out after a draft lands, each running against a different rubric. One checks claims against sources. A second flags detectable AI-writing patterns. A third checks brand voice. The parent integrates. Another concrete application: an AI code-review pipeline where the first pass produces findings under coverage-first prompting and the second pass re-reads the source code to apply a reasonableness filter that downgrades or drops findings whose severity is not defensible in context. I walked through that two-pass pattern, and four others that prevent default-prompted reviewers from grading severity on a curve, in Claude grades severity on a curve.

When to avoid it. Short outputs where the reviewer's token cost exceeds the defect cost it catches. Tight latency budgets. Cases where the review criteria are not yet stable. A reviewer with vague criteria will either rubber-stamp or hallucinate problems.

How it fails. The reviewer rubber-stamps the producer. This happens most often when the reviewer shares the producer's context or prompt. The fix is asymmetry. Give the reviewer different source material, a different rubric, or a narrower tool allowlist. The engineering manager's guide to agentic governance covers the hook-based enforcement layer that makes review gates stick beyond prototypes.

Pattern 3: Adversarial Dual-Analysis

Adversarial dual-analysis spawns two sub-agents with opposing prescribed framings. They run in parallel against the same input. One argues the artifact is solid. The other argues it is broken. The parent reads both and either reconciles them or picks the stronger one.

This is the only pattern in the four where Anthropic's docs do not give it a name. The closest thing they publish is the Agent Teams "competing hypotheses" use case: "Spawn 5 agent teammates to investigate different hypotheses. Have them talk to each other to try to disprove each other's theories, like a scientific debate." That is structurally the pattern. It is just a use case, not a named entry in a taxonomy. Microsoft Azure names the broader shape "multi-agent debate" under group-chat orchestration. I prefer "adversarial dual-analysis" for two reasons. It is exactly two agents, not a free-form debate with N participants. And the parent is the synthesizer, not a third debater.

The Anthropic primitives. Agent Teams lets you spawn teammates from the same .md definitions with different system prompts or different tool allowlists. The shared mailbox lets them exchange analyses if you want reconciliation between teammates rather than at the parent. At the Messages API level, the voting variant of Parallelization runs multiple copies with different prompts and aggregates outputs. That is the shape you get building this directly against the API without Claude Code sub-agent infrastructure.

When to reach for it. Architectural decisions where the trade-offs are real and evidence supports more than one answer. Debugging sessions with multiple plausible causes where you want the hypotheses surfaced, not blended. Risk assessments where you want the best case and the worst case to make their strongest arguments before you commit. The pattern is not about being adversarial for its own sake. It is prescribed asymmetry forcing one-sided outputs that the parent reconciles.

When to avoid it. Well-defined tasks with unambiguous correct answers. Cost-sensitive paths where 2x to 3x single-agent spend is not justified by decision quality. Situations where you already know the answer and want validation. Adversarial dual-analysis is the wrong tool for confirmation.

How it fails. The prescribed framings collapse into agreement. This happens when the framings are too similar. When the input material strongly supports one side. When both sub-agents inherit the same hidden assumptions from the parent. The fix is prescribed friction. Give the advocate access to supporting evidence and the skeptic access to opposing evidence. Or give them different rubrics that force them to evaluate on different dimensions.

Pattern 4: Hierarchical Planner-Executor

Hierarchical planner-executor separates the planning role from the execution role. A planner sub-agent reads the input and decomposes it into a task list. N executor sub-agents run the tasks. A synthesizer integrates the results. This is Anthropic's "Orchestrator-Workers" pattern from the 2024 research post. The name I use here makes the two distinct roles more obvious. Addy Osmani's Code Agent Orchestra discusses the same structure under the framing of hierarchical subagents and teams-of-teams. Multiple labels, one shape.

The Anthropic primitives. Managed Agents is in beta as of April 2026, with multi-agent coordination available as a sub-feature. It expresses this pattern directly through a multiagent coordinator declaration. A coordinator agent declares which other agents it can invoke through its multiagent.agents roster. Each called agent runs in its own session thread with isolated context and tools. Note the sharp constraint: only one level of delegation. A coordinator can call agents, but those agents cannot call further agents. Two simpler shapes show the same planning-before-execution idea in production. The Anthropic effective-harnesses post pairs an Initializer Agent (runs once to set up the environment) with a Coding Agent (runs in later sessions on incremental work). And the April 2026 CodeRabbit webinar shows it deployed against production code review.

When to reach for it. Work that benefits from planning-before-execution. Heterogeneous tasks that need different tools or different specialists. Long-running work where the plan itself is part of the deliverable and can be reviewed by a human first. Opus 4.7's long-trace coherence and task_budget parameter make the planner role more reliable than it was on Sonnet 4.6. That is why I default to Opus for the planner and Sonnet for the executors.

When to avoid it. When a flat peer-loop will do the job. Anthropic's own C compiler project is the cleanest example. Nearly 2,000 Claude Code sessions. 2 billion input tokens. 140 million output tokens. Roughly $20,000 total. And the winning architecture was not hierarchical. Each instance claimed tasks via git lock files, merged upstream, pushed back. No planner. No synthesizer. Flat-loop beat hierarchical because the tasks were granular and independent, and because a GCC oracle test harness provided the quality signal. If your work looks like that, don't reach for hierarchy first.

How it fails. Plan rigidity. The planner commits to a decomposition before the work surfaces what the tasks really need. The executors run the wrong plan. The fix is feedback loops. Let executors return "this task cannot be done as planned, here is what I found," and let the planner replan. Managed Agents' session.thread_status_idle and agent.thread_message_received events are the primitives you wire this into. The harness post's Initializer-plus-Coding-Agent shape avoids the problem by making the plan minimal and deferring most decisions to the Coding Agent.

Cost and Latency Across the Four Patterns

Two Anthropic numbers anchor every cost decision here. From the multi-agent research system post: agents use roughly four times more tokens than chat conversations. Multi-agent systems use roughly fifteen times more tokens than chats. Do the division. Treated as comparable baselines, that implies a multi-agent token multiplier of about 3.75x over a single-agent loop. It is the unit-cost premium you pay for coordination. The 4x and 15x are Anthropic's published figures; the 3.75x between them is my arithmetic, not a number they benchmarked directly. It does not go away with better prompting.

The multiplier applied to each pattern, using a single-agent loop as the 1x baseline:

Pattern	Token multiplier vs single-agent	Wall-clock shape	Model-routing default
Parallel fan-out	~N (one per peer)	Bounded by slowest peer + synthesis	Sonnet for peers; Opus for parent synthesis
Sequential review chains	2x to 3x	Additive (producer + reviewer)	Sonnet for producer; Opus for reviewer when stakes justify
Adversarial dual-analysis	2x to 3x	Bounded by slowest analyst + synthesis	Sonnet for analysts; Opus for synthesis
Hierarchical planner-executor	1 planner + N executors + 1 synthesizer	Planner serial, executors parallel, synthesizer serial	Opus for planner + synthesizer; Sonnet for executors

The latency shape matters as much as cost. Fan-out and adversarial dual-analysis are the only patterns that buy you wall-clock speed. Sequential review chains always run slower than a single agent doing both jobs. Hierarchical planner-executor only runs faster when executor parallelism outweighs the serial planner and synthesizer phases.

Model routing is where you claw back some of the cost multiplier. The Opus 4.7 routing analysis lays out the signals that decide between Opus and Sonnet: trace length, tool-call density, supervision. For orchestration the defaults I use are straightforward. Put Opus on the roles that need long-trace coherence: planners, synthesizers, reviewers with stakes. Put Sonnet on the roles that run in parallel and finish quickly: executors, peer fan-out workers, individual analysts. Opus 4.7 spawns fewer subagents by default. That matters when the planner in a hierarchical pattern is the one making the dispatch call. Expect to prompt it explicitly if you want the delegation behavior Sonnet 4.6 or Opus 4.6 gave you for free.

One more note on Opus 4.7. The task_budget parameter is an advisory token budget across an entire agentic loop, not a hard per-call cap. The docs are explicit that it is a suggestion the model targets, not a ceiling it is forced to respect. That still makes it a useful governance lever for hierarchical planner-executor, with one caveat: it is a Messages API feature (output_config.task_budget), not a Claude Code or Managed Agents coordinator setting. Set a generous total on the planner's own requests and let the model self-regulate against the server-side countdown rather than mirroring it yourself.

The Case Against Reaching for Sub-Agents

I want to take the counter-argument seriously, because the steel-man is strong and it comes mostly from Anthropic's own engineers.

The effective harnesses post reads, in Anthropic's own words: "it's still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture." That is Anthropic hedging on their own thesis. The same post frames specialized sub-agents as a reasonable hypothesis rather than a proven outcome, saying "it seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent, could do an even better job at sub-tasks." The engineering team that built the multi-agent research system is on record saying the multi-agent-vs-single-agent question is an open empirical one for their own use cases.

The multi-agent research system post itself is careful to scope the 90.2% improvement metric: it applies only to breadth-first research queries. Anthropic carves out coding as a category with "fewer truly parallelizable tasks than research." Most production workloads are closer to coding than research. The headline number does not generalize.

Then there's security. Simon Willison's "lethal trifecta" argument compounds with agent count: private data access plus untrusted content plus external communication equals a prompt-injection exfiltration risk. Every sub-agent that ingests untrusted content is an independent exfiltration vector. Multi-agent systems multiply the attack surface.

Microsoft Azure's AI Agent Orchestration Patterns arrives at the same place from a different angle. Multi-agent orchestration is the highest-complexity option in their taxonomy. Their explicit recommendation is to start with a direct model call or a single agent with tools. Escalate only when "a single agent can't reliably handle the task due to prompt complexity, tool overload, or security requirements."

Taken together, the steel-man is clear. Most production workloads don't meet the escalation threshold. The 15x token multiplier makes the unit economics brutal for non-research-shaped work. The attack surface expands with agent count. Even Anthropic's own team calls the question unsettled.

I believe the steel-man is correct for most workloads. A flat single-agent loop with good prompting and stable tool use is the right default. The four patterns exist for the minority of workloads where flat-loop is measurably failing. They aren't an upgrade for the majority case. My decision process for whether a workload has crossed the threshold:

Escalation signal	Meaning	Pattern it points at
Tasks are independent and share a deadline	Wall-clock is the binding constraint, not unit cost	Parallel fan-out
Review rubric is stable but producer and reviewer need different tools	Quality gate with asymmetric framing	Sequential review chains
Evidence genuinely supports more than one interpretation	Diagnostic or architectural judgement call	Adversarial dual-analysis
Work requires planning before execution and tasks are heterogeneous	Plan itself has value; executors can specialize	Hierarchical planner-executor
Child agents would combine private data, untrusted input, and outbound tools	Security constraint dominates the choice, whatever the workload shape	Sandbox first: least-privilege tool allowlists and no-egress reviewers, before any pattern
None of the above	Stay flat-loop	No pattern

If your workload does not light up at least one of the four signals above, the 3.75x token multiplier is buying you nothing. Use a single-agent loop and move on.

Closing

Anthropic has given us the primitives. Sub-agent files. Agent Teams. Managed Agents with multiagent coordinators. Parallel tool_use blocks. Five named workflow patterns from December 2024. What they have not given us is a consolidated picture that tells a technical leader which pattern to reach for on Monday morning.

The four patterns in this post are that picture. Parallel fan-out for independent work under a shared deadline. Sequential review chains for quality gates with asymmetric framing. Adversarial dual-analysis for judgement calls with ambiguous evidence. Hierarchical planner-executor for work that benefits from planning-before-execution. Each pattern has Anthropic primitives behind it, quantified cost and latency behavior, and a failure mode you can diagnose before it wastes a quarter.

The steel-man stays standing. Flat-loop is the right default. Sub-agents are for the minority of workloads where a quality ceiling or latency floor is being hit, and where one of the four escalation signals is clearly present. If that describes a workload you are struggling with, the agentic workflow development work I do focuses on exactly this decision and the engineering that follows from it.

The diagnostic is the hardest part. Book a fifteen-minute working session with me. I'll walk through your current Claude workload with you, we'll identify whether one of these four patterns raises your quality ceiling, and you'll hang up with a prioritized next step in writing.

Glossary terms used

Subagent orchestration

The Four Sub-Agent Orchestration Patterns That Cover 90% of Production Claude Workloads

The Four Patterns, Up Front

Why Anthropic Hasn't Published This (and What It Costs You)

Pattern 1: Parallel Fan-Out

Pattern 2: Sequential Review Chains

Pattern 3: Adversarial Dual-Analysis

Pattern 4: Hierarchical Planner-Executor

Cost and Latency Across the Four Patterns

The Case Against Reaching for Sub-Agents

Closing

Subagent Orchestration in Production: Trade-offs and Failure Modes

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Continue reading: more in Build with Claude

The AI Diligence Operating System: Schema First, Subagents Second

When to Trust an LLM Judge (and When to Pull It Off the Gate)

The Changelog Says 5 Levels. My Probe Went 9 Deep. Inside Claude Code's Nested Subagents.

Sources

The Four Patterns, Up Front

Why Anthropic Hasn't Published This (and What It Costs You)

Pattern 1: Parallel Fan-Out

Pattern 2: Sequential Review Chains

Pattern 3: Adversarial Dual-Analysis

Pattern 4: Hierarchical Planner-Executor

Cost and Latency Across the Four Patterns

The Case Against Reaching for Sub-Agents

Closing

Reference guides for this topic

Subagent Orchestration in Production: Trade-offs and Failure Modes

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Continue reading: more in Build with Claude→

The AI Diligence Operating System: Schema First, Subagents Second

When to Trust an LLM Judge (and When to Pull It Off the Gate)

The Changelog Says 5 Levels. My Probe Went 9 Deep. Inside Claude Code's Nested Subagents.

Sources

Continue reading: more in Build with Claude