The Changelog Says 5 Levels. My Probe Went 9 Deep. Inside Claude Code's Nested Subagents.

Nine levels deep, zero errors. That is how far a chain of subagents spawning subagents ran on my machine the morning after the feature shipped, and the changelog had told me it would stop at five.

The line itself is eleven words. It landed in Claude Code v2.1.172 on June 10, 2026: "Sub-agents can now spawn their own sub-agents (up to 5 levels deep)." The reaction I watched was almost unanimous, and almost all of it was the same word: finally. Deeper agent trees, at last. I read it the other way, so I wrote a subagent whose only job was to spawn a copy of itself and pointed it at the ceiling. It blew past five without complaint.

That gap between the documented limit and what actually happened is the whole post in miniature. The capability is real and genuinely useful for a narrow shape of work. But the number in the changelog is the least interesting thing about it, and the framing that nesting gives you "more room" has the economics backwards. Every level you add is a context window that gets the dispatch prompt and nothing else on the way down, and hands back a summary and nothing else on the way up. Depth is a budget you spend, not headroom you fill.

So here is the claim I will defend, as a companion to the subagents-versus-skills piece I published the same morning: nesting earns its keep mainly when the shape of the work is itself recursive, and otherwise only when an extra level does a job a flat wave cannot. For everything else, an added level is pure overhead, and the misuse I expect is teams mirroring their org chart into an agent tree.

What actually shipped (and what the docs still deny)

Start with the facts, because the documentation does not agree with itself yet.

The v2.1.172 release notes carry the nesting line verbatim, alongside a telling bug fix in the same release: "Fixed a background sub-agent staying stuck as 'active' in the agent panel after a nested agent it spawned was stopped." Nesting was live enough before general availability to generate a UI bug. That is a real feature, not a doc typo.

Now the part that will trip you up. As of the morning after release, the official Create custom subagents page still says the opposite, in three separate places: that subagents cannot spawn other subagents, that this "prevents infinite nesting," and that the Agent tool "has no effect in subagent definitions." The Agent SDK subagents page agrees with the old story too: "Subagents cannot spawn their own subagents. Don't include Agent in a subagent's tools array." The changelog and the docs are pointing in opposite directions, and the changelog is newer. When the two disagree, the release notes win and the docs are lagging. Cite the release notes, anchor the claim to the version, and re-check before you depend on it.

One more distinction the docs do not draw clearly, and no competitor I found draws at all. Nesting is scoped to file-based subagents in the Claude Code CLI. Three adjacent surfaces did not change:

Surface	Can it nest?	Hard limit
File subagents (`.claude/agents/`)	Yes, as of v2.1.172	Changelog says 5 levels
Agent SDK subagents	No (per current SDK docs)	`Agent` tool withheld
Agent Teams	No nested teams	Only the lead manages
Managed Agents API	One level only	`depth > 1` ignored, 20 agents max

If you read "nesting shipped" and assume it means everywhere, you will design against the wrong surface. The API enforces a flat coordinator-worker shape on purpose. The CLI is the one that grew depth, which makes the live question not whether you can nest, but whether the work in front of you should, which is where the budget comes in.

This capability was also wanted, specifically and for a while. A July 2025 GitHub issue asked for exactly it, with a concrete use case: a pull-request review agent that drowns in CI logs and needs to hand the log analysis to its own child rather than rot its own context. The issue even proposed depth limits "to prevent infinite recursion." A community project, gruckion/nested-subagent, shipped a workaround that spawned isolated claude -p processes because the old harness stripped the spawning tool out of subagents entirely. So the demand was real. The question was never whether anyone wanted depth. It was what depth costs, and whether the work in front of you is the shape that pays for it.

I probed the depth cap. It didn't bind where the changelog says.

The changelog says five. I wanted to know what "five" meant in practice, because the docs never define how a level is counted, so the day after release I ran two probes on Claude Code v2.1.173.

The design was deliberately dumb. One general-purpose subagent, dispatched with a self-replicating prompt: run one shell command, attempt to spawn a single child carrying the same prompt with the level number bumped by one, wait, report what the child said. The first probe was report-only and reached level seven with every spawn succeeding. Report-only has an obvious hole, though: a single agent could fabricate a tower of nested reports without ever spawning anything. So the second probe added a tamper check. Each level appended one line to a trace file on disk, /tmp/nest-probe-trace.txt, an artifact my top-level session could read independently after the dust settled.

Down: the prompt. Up: a summary. Sideways: a trace line that proves it ran.

The chain ran to level nine and stopped there because that was the cap I wrote into the prompt, not because the harness pushed back. Every spawn from level one through level eight returned cleanly. The trace file held exactly nine lines, LEVEL 1 through LEVEL 9, timestamped in strict order from 09:57:19 to 10:01:08, each twenty-four to thirty-four seconds after the last. That rules out the cheapest fabrication, a text-only report with nothing behind it, though a determined single agent with shell access could still stage timed writes; what makes me confident the chain really executed is the trace plus the nested return structure I watched come back through the Agent tool plus per-level latency that matches real spawn overhead. Strong evidence, not airtight proof. [VERIFIED -- probe 2026-06-11] The documented five-level cap did not bind at nine.

I want to be precise about what that does and does not mean, because the honest claim is narrower than the exciting one. This was one machine, one install, one interactive session, on a feature that is days old and that Boris Cherny announced as experimental, with the depth of five framed as a starting point he wanted feedback on. The defensible statement is "the cap did not bind in my probe," not "there is no cap." It could tighten in the next patch. Two facts did hold at every level and are worth keeping: the spawning tool propagated all the way down, so a depth-eight agent could still spawn, and every nested child inherited my session's working directory unchanged. The second one matters if you run worktrees, and I will come back to it.

One more thing the second probe surfaced that I did not expect. When it finished, the harness attached an automated security warning to the result, flagging the recursive self-replicating dispatch as an uncontrolled agent loop. The probe was bounded and intentional, so the warning was a false positive in my case. But it is a genuine finding: nested self-spawning is a pattern that trips automated safety review, which is worth knowing before you wire one into a workflow that runs unattended.

Here is the load-bearing takeaway. If the platform will not reliably stop you at depth five, then the budget is yours to enforce. The discipline that keeps a deep tree from becoming a money fire is not a config value. It is a design decision you make before you write the first agent file.

The recursive-shape test: what nesting is actually for

Cherny's own framing for the feature is context management, not hierarchy. That is the right frame, and it points straight at the one question that should gate every decision to add a level.

Does decomposing this task produce subtasks that themselves require decomposition you could not have enumerated from the top?

If yes, the work is recursive, and depth is the natural fit. The pull-request reviewer from that GitHub issue is the clean example: it cannot know in advance how many CI failures it will find or how deep any one of them goes, so handing each failure to a child that may itself spawn a log-analysis grandchild matches the shape of the problem. If no, if you can list the subtasks up front, you do not have recursive work. You have a flat fan-out wearing a tree costume, and a single orchestrator dispatching one level of workers is both cheaper and clearer.

The research backs the distinction more sharply than I expected. The ReDel study of recursive multi-agent systems found recursion's payoff swung hard on task shape: it beat a flat baseline on some multi-hop and web-navigation metrics and trailed it on others, and its two failure modes mapped straight onto the mismatch. One was overcommitment, an agent grinding on a task it should have split into children. The other was undercommitment, an agent delegating work it should have just done itself. Same machinery, opposite results, and task shape was the variable that decided which. A separate paper on when divide-and-conquer works for long context names the preconditions almost as a checklist: decomposition wins when subtasks have low cross-dependence, when the model is sensitive to context length, and when you have a robust way to recombine the pieces. Read those three conditions as the spec for "recursive shape." If your subtasks talk to each other constantly, depth will hurt.

This is the same gate from a different altitude than the four-gate decision map I wrote for flat subagent orchestration, with one addition that nesting forces: the gate has to pass at every level, not just the root. A task can clear the bar at level zero and fail it at level three, where a child is delegating out of habit rather than necessity. Each level is its own decision, and that is the whole budget: you spend a level only where it earns its place.

One caution, because a test like this can quietly go circular: call every success "recursive" and every failure "a flat fan-out that should have known better," and you have a slogan, not a rule. So make the call before you dispatch, on evidence you can write down. A level earns its depth when you can name four things up front: the unknown frontier the child has to go inspect, the logs or the subtree or the dependency graph the parent cannot cheaply see for itself; the local signal that will tell the child whether to spawn its own children; the recomposition artifact the child returns that the parent can actually verify; and the stop condition that ends the recursion. Cannot fill in all four? Then you are guessing, and a flat wave you inspect and re-dispatch is the safer tool. Unknown subtask count alone does not earn nesting, because a root can dispatch one wave, read what comes back, and dispatch another. The thing that earns a child its own children is local knowledge the root cannot cheaply get.

Two honest amendments to "only when recursive" follow from that. First, a non-recursive level can still earn its place when it does a job a flat wave cannot: a specialist that compresses long, noisy evidence into a verdict before the root reads it, a permission boundary a read-only child enforces structurally, a cold verification pass that has to start without the author's context. What does not earn a level is a reporting line. Second, a static hierarchy, your module tree or your folder layout, can map to a real agent tree, but only where each node is a genuine low-coupling context boundary with its own spawn criterion, not just because that is how the codebase happens to be filed.

The bill: what every delegation boundary costs

Every signal that says "use depth" is buying something with tokens, accuracy, and visibility. You should know the three line items before you sign. One honesty note before the numbers: the studies below measure flat fan-out and two-tier orchestration, not deep CLI subagent nesting specifically. I read them as risk priors, the direction the costs run, not as proof of how your particular tree will behave.

Start with tokens, because the number is public and large. Anthropic's own writeup of their multi-agent research system reports that it burned roughly fifteen times the tokens of a chat session, with a single agent around four times; that post is about a year old now, so treat the multiple as directional rather than current, but the structural reason does not age. In the same writeup, token usage alone explained about 80% of the variance in their evaluation scores on that system, which is the cleanest way I know to say the spend is doing real work when there is real work to do. The vendor's cost docs put agent teams at about seven times a standard session when teammates run in plan mode. Depth compounds this: every level is a fresh context window that has to be briefed on the way in and produces output on the way out. A financial-document benchmark found the sweet spot was a supervisor-worker pattern at one level deep, holding the best cost-accuracy position at 1.4x baseline cost. One level of supervision, not five.

Then there is accuracy, which does not merely fail to improve with depth. It actively degrades under the wrong topology. The largest study I found, a Google, MIT, and DeepMind collaboration across 180 configurations and four benchmarks, measured error amplification by coordination structure. Read its four labels as one coordinator managing workers (centralized), peers coordinating directly (decentralized), a mix of the two (hybrid), and agents running with no coordination at all (independent).

Even the best-behaved topology amplifies a single agent's error rate over fourfold. Source: arXiv 2512.08296, 180 configurations across four benchmarks.

Source: Towards a Science of Scaling Agent Systems, arXiv 2512.08296 (2025-12)

The same study found two things that should make you cautious about depth specifically. Multi-agent setups degraded performance by 39% to 70% on sequential planning tasks regardless of topology, and adding agents produced negative returns precisely on the tasks a single agent already handled well, above roughly 45% baseline accuracy; coordination only earned its keep where the single agent was still failing. For the work most readers point Claude Code at, where a single agent already clears that bar, that finding cuts toward less delegation, not more.

The tree-specific risk is error cascade. A separate paper on error cascades in multi-agent collaboration found that a single seeded error reached a 100% infection rate in five of six frameworks, and injection at a hub node was far more contagious than at a leaf. In a nested tree, every intermediate node is an upstream dependency for its whole subtree, and the wider its fan-out and the weaker its verification, the more contagious a mistake at that node becomes. An error at level two can poison everything below it, and the root never sees the reasoning that went wrong.

Which is the third cost, and the one that worries me most: visibility. A subagent returns its final message and nothing else. Its intermediate tool calls, its dead ends, the moment it misread the brief, all of it stays inside that level's context and never surfaces. Anthropic describes subagents condensing tens of thousands of explored tokens into 1,000-to-2,000-token summaries, which is the feature working as designed. But stack that compression five deep and the root orchestrator is reading a summary of a summary of a summary. There is even a formal version of the worry: a 2026 paper argued that under equal token budgets, single agents match or beat multi-agent systems on multi-hop reasoning, because every handoff between agents can only lose information, never add it. That result has scope limits (it excludes tool use and assumes the single agent uses its context perfectly), but the intuition is exactly right for nesting. Each boundary is a one-way valve for context. The deeper you go, the less the top knows about the bottom. As I argued in the dynamic-workflows piece, more agents do not add verification; they multiply whatever your contract already does, including nothing.

The strongest case against the budget framing

I have been arguing depth is a cost. The honest version of this post has to admit there is good evidence it can be a bargain, and I have to engage the strongest form of it.

The sharpest counter is a 2026 paper on recursive multi-agent systems that reported moderate recursion improving accuracy by an average of 8.3% while cutting token spend by 34.6% to 75.6% against the strongest baselines, on tasks including math, code, and scientific reasoning. If depth can buy both better answers and lower cost at the same time, then "budget, not headroom" collapses, because the budget came back positive. And it is not only academics. Addy Osmani, on the Chrome team, advises practitioners to not stop at one level of delegation and to "spawn feature leads that spawn their own specialists" for deeper decomposition without blowing up any single context window. Anthropic's framing cuts the same way: in its own guidance on building agents, isolation is sold as the benefit, the noisy work kept out of the parent's view on purpose.

The objection is real, and I think it sharpens the thesis rather than breaking it. Two things hold the line.

First, the comparison that matters is token-equal: give the single agent the same total thinking budget the multi-agent system spends across all its members. That control is the one the optimistic studies often lack, and when you apply it, much of the multi-agent advantage narrows or disappears. So a chunk of the recursive win is buying accuracy with tokens, which is allowed, but it means depth pays when you genuinely have the budget and the task can use it, not as a free lunch. That is the budget framing, confirmed, not refuted.

Second, look closely at what Osmani recommends. His trees decompose by the structure of the work: a feature lead owns a feature, and spawns specialists for the parts of that feature. That is recursive shape. It is the precise opposite of the failure mode I am warning about, which is decomposing by org chart, where a level exists because a human reporting line exists, not because the work underneath it needs its own decomposition. The good pattern and the anti-pattern can look identical in a diagram. They are distinguished entirely by why the level is there. So the thesis lands where the evidence does: depth is a budget, and it is a budget well spent exactly when the recursive-shape test passes. The danger is spending it because the changelog said you could.

Running depth in production: the discipline that makes it safe

If the platform will not enforce a depth budget for you, here is the discipline I would put around nested subagents before trusting one with real work. It is calibrated from running a twelve-subagent pipeline in production, not from the changelog.

Default the depth budget to one

One level of specialists is the baseline. Every additional level has to be justified in the dispatching prompt by a recursive-shape argument, not a vibe. If you cannot write the sentence explaining why this child needs its own children, it does not.

Make specialists read-only; the orchestrator owns writes

In my pipeline every research and review subagent carries Read, Grep, and web tools and nothing that mutates disk. It is an allowlist the harness enforces at dispatch, not a convention an agent is asked to honor, and the boundary travels with the agent definition down every level. A depth-three agent cannot write files it was never given the tools to write.

Use files as the interface, not return values

Have each level write its full findings to disk and return a short summary. This keeps the upward summaries small and auditable, and it gives you the intermediate state the compression would otherwise destroy. The trace file in my probe is the same idea: the artifact on disk is the ground truth, not the agent's self-report.

Design the verification contract before the tree

Decide how you will know each level did its job before you let it spawn the next. A fleet pointed at a weak gate just produces wrong answers faster and deeper. The contract is the part you design first, the agents second.

Instrument for depth, because self-report thins as you descend

The deeper the node, the less its work surfaces to the top, so the more you need disk artifacts and logs over the agent telling you it finished. Telling you it is done and being done are different claims, and that gap widens with every level.

Guard shared state, because deep writers are invisible

Nested children inherited my session's working directory all the way down. A depth-three agent writing to the same tree the root is editing is a collision the orchestrator cannot see coming. Worktree isolation and write-path guards matter more with depth, not less.

None of this is exotic. It is the same posture that keeps flat fan-out from spiraling, applied to a structure that punishes its absence harder. The broader industry guidance agrees on the starting point: Microsoft's architecture center tells teams that if a single agent can reliably solve the scenario, the coordination overhead of multiple agents often exceeds the benefit. Start at the lowest complexity that works. Add a level only when the work in front of you has a shape that demands it. And if a single-level fan-out is what you actually need, Claude Code's /batch or a flat orchestrator is the cheaper, more legible tool than a tree you have to spelunk to debug. The error-cascade study offers one encouraging number for teams that do go deep: a dedicated governance layer prevented 89% of error infections without changing the architecture at all. The discipline is the product.

FAQ

Can Claude Code subagents spawn other subagents?

Yes, as of Claude Code v2.1.172 (released June 10, 2026). File-based subagents defined in .claude/agents/ can now spawn their own subagents, which the changelog caps at five levels deep. The capability is CLI-and-file-scoped: the Agent SDK's programmatic subagents are still documented as non-nesting, and the main subagents docs page had not been updated to reflect the change as of the day after release. Verify the current state before you build on it.

How deep can nested subagents actually go in Claude Code?

The v2.1.172 changelog says up to five levels. When I probed it the day after release on v2.1.173, a recursive chain ran nine levels deep with zero spawn failures, verified against a disk trace, so the documented cap did not bind on my machine. Treat that as "the cap did not bind in my probe," not "there is no cap." The feature is experimental and the depth behavior can change in any patch release. Do not design a workflow that relies on a specific depth.

Do nested subagents inherit context from their parent?

Each level starts a fresh context window. It receives the dispatch prompt its parent writes, plus your standing project context (CLAUDE.md and git status load into every custom subagent), and nothing else: not the parent's conversation history, not the grandparent's. Coming back, only the final message returns to the level above. Every delegation boundary is a context reset in both directions.

When should I use nested subagents instead of flat fan-out?

Use depth mainly when the work is itself recursive: when decomposing a task produces subtasks that themselves require decomposition you could not have enumerated from the top. One non-recursive case also earns a level, a job a flat wave cannot do: compressing noisy evidence into a verdict before the root reads it, enforcing a permission boundary with a read-only child, or running a cold verification pass. If you can list all the subtasks up front and none of those apply, that is flat fan-out from a single orchestrator, and one level is correct. Mirroring an org chart or a folder tree into an agent hierarchy, with no such job at each node, adds delegation boundaries that cost tokens, accuracy, and visibility without buying anything.

Spend the budget where the work is recursive

The changelog gave you a deeper tree. It did not give you a reason to grow one. Strip away the version number and the depth count, and nested subagents pose the same question every delegation poses, just more of it: does this work need to happen in a separate room, briefed cold and reporting back a summary? When the answer is yes all the way down, when the task genuinely fractals into subtasks you could not enumerate from the top, depth is the right tool and the budget is well spent. When the answer is no, an extra level is a context reset, a token multiplier, and a blind spot, dressed up as structure.

My probe ran nine levels deep without complaint, which tells me the platform will not save you from over-nesting. The recursive-shape test will. Default to one level, make every deeper one argue for itself, and keep the artifacts on disk so you can see what the summaries hid. That is the difference between a tree that earns its depth and an org chart you accidentally compiled into agents.

If you are mapping out where subagents, skills, and hooks each belong in your setup, the subagents-versus-skills companion is the place to start, and the four-gate decision map walks the orchestration call in depth. And if you would rather have a second set of eyes on an agent architecture before it ships, my Agentic Workflow and Automation Development work covers exactly this; a fifteen-minute call is enough to figure out whether your tree should be a tree at all.

Glossary terms used

Subagent orchestration

The Changelog Says 5 Levels. My Probe Went 9 Deep. Inside Claude Code's Nested Subagents.

What actually shipped (and what the docs still deny)

I probed the depth cap. It didn't bind where the changelog says.

The recursive-shape test: what nesting is actually for

The bill: what every delegation boundary costs

The strongest case against the budget framing

Running depth in production: the discipline that makes it safe

Default the depth budget to one

Make specialists read-only; the orchestrator owns writes

Use files as the interface, not return values

Design the verification contract before the tree

Instrument for depth, because self-report thins as you descend

Guard shared state, because deep writers are invisible

FAQ

Can Claude Code subagents spawn other subagents?

How deep can nested subagents actually go in Claude Code?

Do nested subagents inherit context from their parent?

When should I use nested subagents instead of flat fan-out?

Spend the budget where the work is recursive

Subagent Orchestration in Production: Trade-offs and Failure Modes

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Running Claude Code as a Production Engineering Practice

Continue reading: more in Build with Claude

When to Trust an LLM Judge (and When to Pull It Off the Gate)

Claude Code Can Run a Thousand Subagents. The Verification Contract Is the Part You Design First.

When to Orchestrate Claude Code Subagents: A Four-Gate Decision Map

Sources

What actually shipped (and what the docs still deny)

I probed the depth cap. It didn't bind where the changelog says.

The recursive-shape test: what nesting is actually for

The bill: what every delegation boundary costs

The strongest case against the budget framing

Running depth in production: the discipline that makes it safe

Default the depth budget to one

Make specialists read-only; the orchestrator owns writes

Use files as the interface, not return values

Design the verification contract before the tree

Instrument for depth, because self-report thins as you descend

Guard shared state, because deep writers are invisible

FAQ

Can Claude Code subagents spawn other subagents?

How deep can nested subagents actually go in Claude Code?

Do nested subagents inherit context from their parent?

When should I use nested subagents instead of flat fan-out?

Spend the budget where the work is recursive

Reference guides for this topic

Subagent Orchestration in Production: Trade-offs and Failure Modes

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Running Claude Code as a Production Engineering Practice

Continue reading: more in Build with Claude→

When to Trust an LLM Judge (and When to Pull It Off the Gate)

Claude Code Can Run a Thousand Subagents. The Verification Contract Is the Part You Design First.

When to Orchestrate Claude Code Subagents: A Four-Gate Decision Map

Sources

Continue reading: more in Build with Claude