The Four Sub-Agent Orchestration Patterns That Cover 90% of Production Claude Workloads
In this post
- Familiarity with Claude Code or the Claude API
- Experience running at least one production agent loop
- Comfort reading Anthropic documentation as primary source material
A multi-agent Claude system uses roughly 15 times more tokens than a chat conversation. Anthropic published that number themselves. Same post also reports a 90.2% performance gain on research queries. Both numbers are accurate. Neither tells you whether to build one.
That is the strange place Claude teams find themselves in as of April 2026. The primitives are public. Sub-agent files under .claude/agents/. Agent Teams with shared task lists. Managed Agents with callable_agents IDs. Parallel tool_use blocks at the Messages API level. Five named workflow patterns shipped in the December 2024 research post. Opus 4.7 tuned its defaults specifically for these loops.
And yet there is no canonical reference architecture. No decision framework that says "for workloads like X, reach for pattern Y." No coordination-failure taxonomy. No cross-workload benchmark data beyond the 90.2% research number.
Teams I talk to get to this point and stall. They have moved past one-engineer-one-Claude usage. Coordinating multiple agents feels like the next step. The Anthropic docs give them the pieces but not the picture.
This post is the picture. Four patterns that cover most production workloads. Decision criteria. Cost math. The cases where the patterns fail. Every claim is tied to an Anthropic primary source or a named practitioner I trust.
The Four Patterns, Up Front
If you are going to close this tab in ninety seconds, read this section and go.
- Parallel fan-out. Scatter independent work across N peer sub-agents and gather results at the parent. Highest payoff when tasks are truly independent, coarse-grained enough to amortize context overhead, and fit a shared deadline.
- Sequential review chains. Sub-agent A produces, sub-agent B reviews or critiques, the parent integrates. Useful as a structured quality gate when the producer and reviewer need different framings or different tool access.
- Adversarial dual-analysis. Two sub-agents with explicitly opposing prescribed framings produce independent analyses, and the parent synthesizes or picks the stronger argument. Useful for diagnostic questions where evidence is ambiguous and you want the disagreement to surface.
- Hierarchical planner-executor. A planner sub-agent decomposes work into tasks, executor sub-agents run them, a synthesizer integrates. This is Anthropic's "orchestrator-workers" pattern under a name that clarifies the two distinct roles.
The decision matrix. I'm going to earn every row in the sections that follow, but a technical leader scanning for "which one do I pick on Monday" should have the answer without scrolling through 3,000 words first.
| Pattern | When to reach for it | Cost tier | Latency profile | Primary failure mode |
|---|---|---|---|---|
| Parallel fan-out | Independent subtasks, shared deadline, cost of N-way token spend justifies walltime gain | N × single-agent | Bounded by slowest peer | Uniform bottleneck if tasks are not decomposed finely enough |
| Sequential review chains | Quality gates with different reviewer framing; producer and reviewer need different tools | 2× to 3× single-agent | Additive across agents | Reviewer rubber-stamps the producer instead of disagreeing |
| Adversarial dual-analysis | Ambiguous evidence, architectural judgement calls, debugging competing hypotheses | 2× to 3× single-agent | Bounded by slowest analyst + synthesis | Prescribed framings collapse into agreement, surfacing nothing |
| Hierarchical planner-executor | Genuine decomposition benefit, heterogeneous work, long-running work that benefits from a plan | 1 planner + N executors + 1 synthesizer | Planner time, then parallel executors, then synthesis | Plan rigidity: planner commits before the work surfaces what the tasks really need |
Every row has a failure mode column for a reason. The steel-man at the end of this post is built almost entirely from Anthropic's own cautionary quotes, and the failure modes are where those quotes hit hardest.
Why Anthropic Hasn't Published This (and What It Costs You)
Precision matters here. The gap between what exists and what does not is the whole reason this post needs to exist.
What Anthropic publishes today:
- Architectural primitives. The sub-agent file format (
.mdunder.claude/agents/with YAML frontmatter). TheAgenttool (renamed fromTaskin Claude Code v2.1.63). Agent Teams with a shared task list and peer mailbox. Managed Agents withcallable_agentssession threads. Every one of these is documented. - Five named workflow patterns. Prompt Chaining, Routing, Parallelization, Orchestrator-Workers, and Evaluator-Optimizer. Published in the December 2024 post Building Effective Agents. The patterns still hold up. The post predates Agent Teams and Managed Agents.
- Two production case studies. The multi-agent research system behind Claude's Research feature. The C compiler built with parallel Claude sessions. The first reports 90.2% improvement over single-agent Opus 4 and the 15x token figure. The second spent roughly $20,000 in tokens across 2,000 Claude Code sessions to build a 100,000-line Rust-based C compiler.
- Version-specific notes. Opus 4.7 spawns fewer subagents by default. It also exposes
task_budgetcaps across agentic loops. Sonnet 4.6 and prior are more aggressive delegators.
What Anthropic does not publish:
- No consolidated decision framework. Guidance on when to use which of the five patterns is scattered across four posts and two docs pages. No single document says "here is how to decide."
- No named adversarial-dual-analysis pattern. The closest reference is the Agent Teams "investigate with competing hypotheses" use case. Structurally it is the pattern. It just has no name in Anthropic's material.
- No coordination failure-mode taxonomy. The effective harnesses post catalogs four single-agent failure modes. There is no equivalent for the coordination failures that happen once you are running multiple agents at once.
- No cross-workload benchmark data. The 90.2% number applies only to internal research tasks. Anthropic notes that coding workloads have "fewer truly parallelizable tasks than research." No comparable figures exist for code review, data analysis, content production, or customer support.
The cost of this gap is predictable. Teams reach the point where a single Claude loop stops scaling and grab the nearest scaffolding they can wire together. Six to twelve months later they're still rediscovering patterns that already have Anthropic-published primitives. Or worse, they decide the complexity isn't worth it and quietly give up on the workload.
The rest of this post is the framework that closes the gap.
Pattern 1: Parallel Fan-Out
Parallel fan-out scatters independent work across N peer sub-agents. The parent gathers the results and synthesizes. It is the sectioning variant of what Anthropic calls Parallelization in the 2024 research post. Engineers from a cloud-architecture background will recognize it as fan-out / fan-in or scatter-gather. I use either name depending on the audience.
The Anthropic primitives. Claude Code dispatches sub-agents through the Agent tool. Definitions live as .md files in .claude/agents/. Each sub-agent runs in its own context window with its own system prompt and its own tool allowlist. Agent Teams is an experimental Claude Code feature as of April 2026. It adds shared task lists with file locking plus a peer mailbox. The mailbox matters when teammates coordinate rather than just report back. At the Messages API level, parallel tool_use blocks inside a single assistant turn execute concurrently. That is the fan-out primitive one level below sub-agents.
There is a load-bearing implementation detail here. If you dispatch N Agent tool calls across separate assistant messages, they serialize. If you emit all N in a single message, they run concurrently. I probed this on 2026-04-17 with three native Claude Code sub-agents. The parallel case ran in 14.87 seconds. The serial case took 43.8 seconds. The three-times speedup is structural at the harness layer. You do not get it by asking nicely.
When to reach for it. Research queries with independent sub-questions. Multi-dimension audits where each dimension has its own rubric. Content generation across separable topics. Anthropic's research system pairs a Lead Researcher Agent (Opus 4) with 3 to 5 parallel Sonnet 4 subagents, a configuration from the June 2025 post that may have evolved with newer models. The post attributes "up to 90% for complex queries" in research-time reduction to parallel tool calling across those subagents. That gain is the payoff you are buying with the token multiplier.
When to avoid it. Tasks with shared state. Sequential dependencies. Strict determinism requirements. Agent Teams' own docs call out sequential tasks, same-file edits, and heavy inter-agent dependencies as the anti-cases.
How it fails. Fan-out without fine-grained decomposition. The Anthropic C compiler post documents the clearest example. Their first attempt gave all 16 parallel agents the same monolithic task: compile the Linux kernel. Every agent converged on the same bottleneck. Parallelism added zero value. The fix was not more agents. It was finer decomposition using a GCC oracle as the test harness. The lesson generalizes. Fan-out is a quality-of-decomposition test, not a quality-of-coordination test.
If you want to see this pattern running today, the five foundational API patterns post covers the tool-use and structured-output primitives that make fan-out reliable. Without typed tool contracts, sub-agents drift.
Pattern 2: Sequential Review Chains
Sequential review chains pass a produced artifact from one sub-agent to a second. The reviewer has a different framing and different tool access. The reviewed output returns to the parent for integration. The simplest shape is a single-pass review: produce, review once, ship. The iterative shape loops review and revise until a quality signal clears. Anthropic calls the iterative shape the Evaluator-Optimizer pattern. Microsoft Azure calls it the maker-checker loop. The two shapes look similar and fail differently, so name both before using either.
The Anthropic primitives. The sub-agents documentation uses this pattern as a canonical example: "Use the code-reviewer subagent to find performance issues, then use the optimizer subagent to fix them." That is the single-pass shape. The iterative shape is the Evaluator-Optimizer pattern from the 2024 research post. One LLM generates, another evaluates, and the loop runs until the evaluator's criteria clear. Both shapes need asymmetry. Give the reviewer a different system prompt, a different tool allowlist, or both. A reviewer with the same prompt and the same tools as the producer is asking the producer to grade its own homework.
When to reach for it. Quality gates where the review criteria are stable. Security review on code generated without security context. Accessibility checks on interface code generated without accessibility context. Fact-check passes on content generated without the reviewer's source material. A content-publishing pipeline can wire this in as three reviewer sub-agents that fan out after a draft lands, each running against a different rubric. One checks claims against sources. A second flags detectable AI-writing patterns. A third checks brand voice. The parent integrates.
When to avoid it. Short outputs where the reviewer's token cost exceeds the defect cost it catches. Tight latency budgets. Cases where the review criteria are not yet stable. A reviewer with vague criteria will either rubber-stamp or hallucinate problems.
How it fails. The reviewer rubber-stamps the producer. This happens most often when the reviewer shares the producer's context or prompt. The fix is asymmetry. Give the reviewer different source material, a different rubric, or a narrower tool allowlist. The engineering manager's guide to agentic governance covers the hook-based enforcement layer that makes review gates stick beyond prototypes.
Pattern 3: Adversarial Dual-Analysis
Adversarial dual-analysis spawns two sub-agents with opposing prescribed framings. They run in parallel against the same input. One argues the artifact is solid. The other argues it is broken. The parent reads both and either reconciles them or picks the stronger one.
This is the only pattern in the four where Anthropic's docs do not give it a name. The closest thing they publish is the Agent Teams "competing hypotheses" use case: "Spawn 5 agent teammates to investigate different hypotheses. Have them talk to each other to try to disprove each other's theories, like a scientific debate." That is structurally the pattern. It is just a use case, not a named entry in a taxonomy. Microsoft Azure names the broader shape "multi-agent debate" under group-chat orchestration. I prefer "adversarial dual-analysis" for two reasons. It is exactly two agents, not a free-form debate with N participants. And the parent is the synthesizer, not a third debater.
The Anthropic primitives. Agent Teams lets you spawn teammates from the same .md definitions with different system prompts or different tool allowlists. The shared mailbox lets them exchange analyses if you want reconciliation between teammates rather than at the parent. At the Messages API level, the voting variant of Parallelization runs multiple copies with different prompts and aggregates outputs. That is the shape you get building this directly against the API without Claude Code sub-agent infrastructure.
When to reach for it. Architectural decisions where the trade-offs are real and evidence supports more than one answer. Debugging sessions with multiple plausible causes where you want the hypotheses surfaced, not blended. Risk assessments where you want the best case and the worst case to make their strongest arguments before you commit. The pattern is not about being adversarial for its own sake. It is prescribed asymmetry forcing one-sided outputs that the parent reconciles.
When to avoid it. Well-defined tasks with unambiguous correct answers. Cost-sensitive paths where 2x to 3x single-agent spend is not justified by decision quality. Situations where you already know the answer and want validation. Adversarial dual-analysis is the wrong tool for confirmation.
How it fails. The prescribed framings collapse into agreement. This happens when the framings are too similar. When the input material strongly supports one side. When both sub-agents inherit the same hidden assumptions from the parent. The fix is prescribed friction. Give the advocate access to supporting evidence and the skeptic access to opposing evidence. Or give them different rubrics that force them to evaluate on different dimensions.
Pattern 4: Hierarchical Planner-Executor
Hierarchical planner-executor separates the planning role from the execution role. A planner sub-agent reads the input and decomposes it into a task list. N executor sub-agents run the tasks. A synthesizer integrates the results. This is Anthropic's "Orchestrator-Workers" pattern from the 2024 research post. The name I use here makes the two distinct roles more obvious. Addy Osmani's Code Agent Orchestra discusses the same structure under the framing of hierarchical subagents and teams-of-teams. Multiple labels, one shape.
The Anthropic primitives. Managed Agents is in beta, with the multi-agent coordination sub-feature in research preview as of April 2026. It expresses this pattern directly through callable_agents. A coordinator agent declares which other agents it can invoke. Each called agent runs in its own session thread with isolated context and tools. Note the sharp constraint: only one level of delegation. A coordinator can call agents, but those agents cannot call further agents. The Anthropic effective-harnesses post shows a simpler two-agent shape. An Initializer Agent runs once to establish the environment. A Coding Agent runs in subsequent sessions focused on incremental work. And the April 2026 CodeRabbit webinar shows the planning-before-execution shape deployed against production code review.
When to reach for it. Work that benefits from planning-before-execution. Heterogeneous tasks that need different tools or different specialists. Long-running work where the plan itself is part of the deliverable and can be reviewed by a human first. Opus 4.7's long-trace coherence and task_budget parameter make the planner role more reliable than it was on Sonnet 4.6. That is why I default to Opus for the planner and Sonnet for the executors.
When to avoid it. When a flat peer-loop will do the job. Anthropic's own C compiler project is the cleanest example. Nearly 2,000 Claude Code sessions. 2 billion input tokens. 140 million output tokens. Roughly $20,000 total. And the winning architecture was not hierarchical. Each instance claimed tasks via git lock files, merged upstream, pushed back. No planner. No synthesizer. Flat-loop beat hierarchical because the tasks were granular and independent, and because a GCC oracle test harness provided the quality signal. If your work looks like that, don't reach for hierarchy first.
How it fails. Plan rigidity. The planner commits to a decomposition before the work surfaces what the tasks really need. The executors run the wrong plan. The fix is feedback loops. Let executors return "this task cannot be done as planned, here is what I found," and let the planner replan. Managed Agents' session.thread_idle and agent.thread_message_received events are the primitives you wire this into. The harness post's Initializer-plus-Coding-Agent shape avoids the problem by making the plan minimal and deferring most decisions to the Coding Agent.
Cost and Latency Across the Four Patterns
Two Anthropic numbers anchor every cost decision here. From the multi-agent research system post: agents use roughly four times more tokens than chat conversations. Multi-agent systems use roughly fifteen times more tokens than chats. Do the division. That is a multi-agent token multiplier of about 3.75x over a single-agent loop running the same workload. It is the unit-cost premium you pay for coordination. It is real. It is published. It does not go away with better prompting.
The multiplier applied to each pattern, using a single-agent loop as the 1x baseline:
| Pattern | Token multiplier vs single-agent | Wall-clock shape | Model-routing default |
|---|---|---|---|
| Parallel fan-out | ~N (one per peer) | Bounded by slowest peer + synthesis | Sonnet for peers; Opus for parent synthesis |
| Sequential review chains | 2x to 3x | Additive (producer + reviewer) | Sonnet for producer; Opus for reviewer when stakes justify |
| Adversarial dual-analysis | 2x to 3x | Bounded by slowest analyst + synthesis | Sonnet for analysts; Opus for synthesis |
| Hierarchical planner-executor | 1 planner + N executors + 1 synthesizer | Planner serial, executors parallel, synthesizer serial | Opus for planner + synthesizer; Sonnet for executors |
The latency shape matters as much as cost. Fan-out and adversarial dual-analysis are the only patterns that buy you wall-clock speed. Sequential review chains always run slower than a single agent doing both jobs. Hierarchical planner-executor only runs faster when executor parallelism outweighs the serial planner and synthesizer phases.
Model routing is where you claw back some of the cost multiplier. The Opus 4.7 routing analysis lays out the signals that decide between Opus and Sonnet: trace length, tool-call density, supervision. For orchestration the defaults I use are straightforward. Put Opus on the roles that need long-trace coherence: planners, synthesizers, reviewers with stakes. Put Sonnet on the roles that run in parallel and finish quickly: executors, peer fan-out workers, individual analysts. Opus 4.7 spawns fewer subagents by default. That matters when the planner in a hierarchical pattern is the one making the dispatch call. Expect to prompt it explicitly if you want the delegation behavior Sonnet 4.6 or Opus 4.6 gave you for free.
One more note on Opus 4.7. The task_budget parameter is an advisory token budget across an entire agentic loop, not a hard per-call cap. The docs are explicit that it is a suggestion the model targets, not a ceiling it is forced to respect. That still makes it the right governance lever for hierarchical planner-executor. Set the budget at the coordinator level, let the model internalize it, and let it decide where to spend.
The Case Against Reaching for Sub-Agents
I want to take the counter-argument seriously, because the steel-man is strong and it comes mostly from Anthropic's own engineers.
The effective harnesses post reads, in Anthropic's own words: "it's still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture." That is Anthropic hedging on their own thesis. The same post frames specialized sub-agents as a reasonable hypothesis rather than a proven outcome, saying "it seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent, could do an even better job at sub-tasks." The engineering team that built the multi-agent research system is on record saying the multi-agent-vs-single-agent question is an open empirical one for their own use cases.
The multi-agent research system post itself is careful to scope the 90.2% improvement metric: it applies only to breadth-first research queries. Anthropic carves out coding as a category with "fewer truly parallelizable tasks than research." Most production workloads are closer to coding than research. The headline number does not generalize.
Then there's security. Simon Willison's "lethal trifecta" argument compounds with agent count: private data access plus untrusted content plus external communication equals a prompt-injection exfiltration risk. Every sub-agent that ingests untrusted content is an independent exfiltration vector. Multi-agent systems multiply the attack surface.
Microsoft Azure's AI Agent Orchestration Patterns arrives at the same place from a different angle. Multi-agent orchestration is the highest-complexity option in their taxonomy. Their explicit recommendation is to start with a direct model call or a single agent with tools. Escalate only when "a single agent can't reliably handle the task due to prompt complexity, tool overload, or security requirements."
Taken together, the steel-man is clear. Most production workloads don't meet the escalation threshold. The 15x token multiplier makes the unit economics brutal for non-research-shaped work. The attack surface expands with agent count. Even Anthropic's own team calls the question unsettled.
I believe the steel-man is correct for most workloads. A flat single-agent loop with good prompting and stable tool use is the right default. The four patterns exist for the minority of workloads where flat-loop is measurably failing. They aren't an upgrade for the majority case. My decision process for whether a workload has crossed the threshold:
| Escalation signal | Meaning | Pattern it points at |
|---|---|---|
| Tasks are independent and share a deadline | Wall-clock is the binding constraint, not unit cost | Parallel fan-out |
| Review rubric is stable but producer and reviewer need different tools | Quality gate with asymmetric framing | Sequential review chains |
| Evidence genuinely supports more than one interpretation | Diagnostic or architectural judgement call | Adversarial dual-analysis |
| Work requires planning before execution and tasks are heterogeneous | Plan itself has value; executors can specialize | Hierarchical planner-executor |
| None of the above | Stay flat-loop | No pattern |
If your workload does not light up at least one of the four signals above, the 3.75x token multiplier is buying you nothing. Use a single-agent loop and move on.
Closing
Anthropic has given us the primitives. Sub-agent files. Agent Teams. Managed Agents with callable_agents. Parallel tool_use blocks. Five named workflow patterns from December 2024. What they have not given us is a consolidated picture that tells a technical leader which pattern to reach for on Monday morning.
The four patterns in this post are that picture. Parallel fan-out for independent work under a shared deadline. Sequential review chains for quality gates with asymmetric framing. Adversarial dual-analysis for judgement calls with ambiguous evidence. Hierarchical planner-executor for work that benefits from planning-before-execution. Each pattern has Anthropic primitives behind it, quantified cost and latency behavior, and a failure mode you can diagnose before it wastes a quarter.
The steel-man stays standing. Flat-loop is the right default. Sub-agents are for the minority of workloads where a quality ceiling or latency floor is being hit, and where one of the four escalation signals is clearly present. If that describes a workload you are struggling with, the agentic workflow development work I do focuses on exactly this decision and the engineering that follows from it.
The diagnostic is the hardest part. Book a thirty-minute working session with me at calendly.com/hello-readysolutions/30min. I'll walk through your current Claude workload with you, we'll identify whether one of these four patterns raises your quality ceiling, and you'll hang up with a prioritized next step in writing.
Want to talk about how this applies to your team?
Book a Free Intro CallNot ready for a call? Take the free AI Readiness Assessment instead.