Seven hundred and fifty thousand lines of Rust. Ninety-nine point eight percent of the existing test suite passing. Eleven days, first commit to merge.

That is the Bun runtime, ported from Zig to Rust, and Anthropic put it at the center of the dynamic workflows launch on May 28. The migration happened and the numbers check out. Then a Rust developer with twenty years on the language read the actual pull request and flagged its tests as a textbook case of a test that repeats the implementation: it validates the code against itself instead of against behavior. So the 99.8% is exactly as trustworthy as what the tests assert, and no more.

Here is the shape of the problem that creates. The agent reports done. The suite is green. You still do not know whether the code is correct, because the thing that decided "correct" was generated alongside the thing it was supposed to check. That gap is survivable when one agent writes one change and you read the diff. Dynamic workflows let one run fan out across hundreds of background subagents. At that scale you cannot read the diffs, and the question of what counts as a passing result stops being something you check after the fact. It becomes the first thing you design. Not the prompt, not the agent. The contract.

Strip the jargon and it is a simple picture. A hundred interns are editing your codebase at once, faster than you can read. You cannot check every change by hand. What protects the codebase is the checklist you wrote before you let them start: what counts as correct, what gets sent back, when they stop. That checklist is the contract, and at a hundred interns you write it first or you do not write it at all.

What dynamic workflows actually changed

Claude Code shipped dynamic workflows in v2.1.154 on May 28, 2026. The mechanic is small and the consequence is not. Instead of holding the orchestration plan in Claude's context, you (or Claude, on request) write the plan as a JavaScript script. The official docs describe the limits directly: up to 16 agents run concurrently, capped at 1,000 agents total per run to prevent runaway loops. The workflow script itself has no direct filesystem or shell access. Only the agents it spawns touch the codebase. Turn on ultracode and Claude plans a workflow for each substantive task instead of waiting for you to ask.

What ships is an orchestration engine plus a library of quality patterns. What does not ship is enforced verification. Read the docs closely and the adversarial review everyone quotes ("independent agents adversarially review each other's findings") is a pattern the workflow script can compose, not a default the platform applies. The bundled /deep-research workflow votes on each claim before it surfaces. A workflow you write has exactly the verification you put in it, and nothing you do not. The launch post's phrase, "results are checked before they're folded in," describes something you build. It is not something you inherit on install.

That distinction is the whole post, so it is worth seeing the three primitives side by side, with the column nobody else fills in:

PrimitiveWhat it isTypical scaleWho owns verification
SubagentA delegated agent with its own context1 to a handfulYou, reading the result
SkillReusable instructions loaded into a turnInlineWhatever the skill encodes
Dynamic workflowA script orchestrating agents in codeTens to 1,000You, in the script, before it runs

The model side makes the fan-out viable. Opus 4.8, released the same day, defaults to a 1M-token context window and 128k max output, and accepts mid-conversation system messages so you can change a long-running agent's instructions without breaking the prompt cache. Bigger context and longer runs are what turn "a few parallel agents" into a fleet. None of it decides whether the fleet's output is right, and on 4.8 the fleet can even misreport whether its own tools ran. That is a separate artifact, and it is the one teams skip.

The unit of risk moved from the diff to the fleet

At three to ten parallel agents, you can still read every diff. Simon Willison's rule from the parallel-agents era holds at that size: do not file pull requests with code you have not reviewed yourself. I leaned on the same fan-out mechanic in the four sub-agent orchestration patterns: emit the Agent calls in a single message and they run concurrently, 14.87 seconds against 43.8 serial. For the durable mechanics behind that fan-out, see the subagent orchestration in production guide. Useful at ten. At a thousand, "read every diff" is not discipline. It is arithmetic that does not close.

What replaces the reviewer is, at first, nothing, and then a queue. I made the tool-selection version of this point in when to reach for /batch: the bottleneck in parallel agent work is not the agents, it is the human reviewing what comes back. The measurements have since caught up. Faros AI's 2026 engineering report, drawn from 22,000 developers' telemetry, found median time in code review up 441.5% at AI-adopting organizations, with pull requests merged without any review up 31.3%. Harness, surveying 700 practitioners, reports 69% of frequent AI-coding users hitting deployment problems. Its 2025 predecessor report found only 6% of teams with fully automated continuous delivery, the pipeline lagging far behind the speed at which AI now writes the code. The verification gap is already in the pipeline. A fleet does not create it. A fleet floods it.

0 subagents one dynamic-workflow run can orchestrate (Claude Code docs)
+ 0 .5% median time in code review at AI-adopting orgs (Faros AI)
+ 0 .3% more PRs now merge with no review at all (Faros AI)
0 % of frequent AI-coding users hit deployment problems (Harness)

I wrote about the single-agent version of this when Claude Code learned to auto-fix pull requests unattended: the merge gate did not disappear, its meaning moved, from "is the code right" to "did the agent satisfy the checks." A fleet inherits that shift a hundred times over in one run. If you never decided what the checks should certify, you have automated your way to a green light that means less than it used to.

Why "more agents, same review" is the wrong model

The reflex is to treat dynamic workflows as a throughput upgrade. Point the fleet at the work, review what comes back like a normal PR, ship faster. It is a reasonable instinct and it is the one that gets teams hurt, because it assumes verification is invariant to scale. It is not.

The fair objection runs like this: isn't this what CI/CD is for? You already do not read every line. The pipeline does. Mostly right, and worth taking seriously, because a team with strong automated gates faces a smaller change here than the hype implies. But three things break the equivalence.

First, volume. CI gates were sized for commits arriving at human pace. A fleet emits at machine pace and the queue saturates, which is the Faros finding restated as cause rather than symptom. Second, multi-agent systems can degrade quality, not only accelerate it. A Google and DeepMind study measured independent multi-agent systems ranging from +80.8% on decomposable reasoning to -70.0% on sequential planning, where architecture-task mismatch was the driver. Anthropic's own engineering write-up on multi-agent research notes these systems burn roughly 15 times the tokens of chat and warns that domains with many dependencies between agents "are not a good fit." Coding is dependency-heavy. Third, and simplest, a nominal pass is not correctness. Bun's test-repeats-implementation case is the whole argument in one PR: a green suite that asserts the wrong thing is a confident lie that scales as fast as everything else.

Note

More agents do not add verification. They multiply whatever your contract already does, including nothing. A fleet pointed at a weak gate produces wrong answers faster, with more conviction, across more of the codebase.

What a verification contract is

A verification contract is the set of acceptance criteria, adversarial checks, deterministic gates, and stop conditions you define before the fleet runs, that any subagent's output must satisfy before it lands. It is the executable answer to one question: what does passing mean here? Four parts carry it.

None of the four is new. Acceptance criteria are test strategy, adversarial verification is code review with the politeness stripped out, deterministic gates are CI, and stop conditions are release policy. The skeptic is right that you already own all four. What changes at fleet scale is that you assemble them into one artifact and design it before the run, instead of bolting each piece on after the agents have already shipped.

Acceptance criteria. What "correct" asserts, stated in terms of observable behavior rather than the code that produced it. The Bun test failed precisely here. It asserted the implementation back to itself. Write the criterion against what the system should do, so that a subagent which rewrites the implementation cannot also rewrite the definition of success.

Adversarial verification. Independent agents prompted to refute a result, not to bless it, with a majority rule. The research points the same way. A study of agentic overconfidence found agents that succeed 22% of the time predicting 77% success, a 55-point gap, with adversarial bug-finding prompts calibrating best. Anthropic's guidance on evaluating agents makes the related point: no single grader catches everything, and you have to evaluate the transcript, not just the endpoint.

Deterministic gates. The checks a model should not be the judge of: tests, types, lint, schema invariants, the build. Encode them as code that returns pass or fail, not as a prose instruction an orchestrator can reason its way around. I drew this exact line writing about Claude Code's /advisor: a discretionary check is a quality lift, a structural gate is enforcement. At fleet scale you need the gate, because there is no human in the loop to catch the run where the model talked itself past the rule. The agent reliability in production guide expands that loop across the checks that keep re-running as the system changes.

Stop conditions and budget. The 1,000-agent cap is a backstop, not a plan. Decide what halts the fleet: loop until the gates pass, stop after k consecutive rounds find nothing new, cut off at a token ceiling. The CISA and Five Eyes guidance on agent systems says it plainly, that deciding which actions need a human is a job for the system designer, not the agent.

The four parts of a verification contractA vertical flow: fleet output passes through acceptance criteria, adversarial verification, deterministic gates, and stop or budget caps, into an accept-or-reject verdict that either lands on main or loops back to the criteria.Fleet outputAcceptance criteriaAdversarial verifyDeterministic gatesStop + budget capsAccept or rejectLands on main reject, loop
Output runs the gauntlet before it lands. Reject loops it back. You author all of this before the fleet starts.

In code, the sketch is short, which is the point. The contract is small enough to read and strict enough to trust:

verify-contract.mjs (sketch)
// Acceptance: behavioral, not implementation-echoing
const accepts = (out) =>
out.testsAssertBehavior && out.publicApiUnchanged;
// Adversarial: refuters vote, a claim must survive a majority
const survives = (claim) =>
refuters(3).filter((r) => !r.broke).length >= 2;
// Deterministic gates: a model gets no vote here
const gated = () =>
run(["npm test", "tsc --noEmit", "lint"]).every((r) => r.ok);
// Stop condition: do not let the fleet run forever
while (open.length && budget.left() > FLOOR) {
// dispatch the next wave
}

This is the schema-first argument I made for diligence pipelines, generalized to code. The schema is the decision you are authorizing the agents to make. Move the orchestration plan into a script and the script stops being plumbing. It becomes the verification spec in a form you can version, test, and enforce, which is the same thing I argued engineering managers should encode into hooks and managed settings: enforcement by design, not by policy.

The honest counter: you cannot design evals for failures you have not seen

The sharpest argument against designing the contract first comes from the people who run evals for a living, and it deserves a straight answer rather than a strawman. Hamel Husain and Shreya Shankar advise against writing evaluators before you build the feature: "Unlike traditional software where failure modes are predictable, LLMs have infinite surface area for potential failures. You can't anticipate what will break." Their prescription is to start from manual error analysis of production traces and build evaluators for the failures you find.

They are right, and they are describing a different artifact. The exhaustive eval suite does emerge from traces. You cannot draw it on day one and you should not pretend to. But the contract is not that suite. The contract is the envelope around it: the shape acceptance has to take, the requirement that something adversarial runs at all, the deterministic gates you already trust today, the condition under which the fleet stops. That envelope is designable before the run. The specific criteria fill in as you watch the fleet fail.

And the envelope earns its keep faster at fleet scale than at single-agent scale, for two reasons. Failure modes repeat across hundreds of subagents, so one run surfaces the distribution that a single agent would reveal one incident at a time. And the leaderboards will not tell you which failures matter: TerminalWorld measured frontier agents topping out at 62.5% on authentic terminal tasks while correlating only r=0.20 with the expert-curated benchmark it was measured against. Even your own pass rate is noisy. SWE-bench-Verified pass@1 swings 2.2 to 6.0 points across identical runs at temperature zero, which means a single green run proves less than it looks like. Evals from your own traces become part of the contract over time, and the contract is what tells you which traces are worth reading.

Design the contract before the fleet runs

The post you are reading went through a pipeline I built, and the pipeline is the argument. Twelve specialist subagents. Five research lanes and five validation lanes that run in parallel. A skeptic-reader whose only assignment is to attack the draft before it ships. Sixteen deterministic validator scripts that block the publish on a single failed check. I do not read every intermediate output those agents produce. I read the verdict, because I decided in advance what the verdict had to certify. That is a verification contract running over subagent orchestration, and it is the only reason I trust a multi-agent pipeline to touch published work.

Before that pipeline, I built a multi-agent persona profiler. Ten or more Opus instances per run. Two analysts read the same transcripts with deliberately opposed framings, a third reconciled the divergences, and twelve validation gates scored the output against a rubric. It landed at 59 out of 60 on voice fidelity. Not because the agents were brilliant, though they were good. It hit the bar because I wrote down what a passing result looked like before any of them ran. The orchestration was downstream of the contract, every time.

That move is what dynamic workflows make mainstream, and the model gives you a little help with it: Anthropic reports Opus 4.8 is around four times less likely than its predecessor to let a flaw in its own code pass unremarked. Take the help. Do not confuse it with the contract. A model that is more honest about its own bugs still does not know what your team means by correct on this codebase. You do, and the fleet will only enforce it if you wrote it down first.

If you are pointing a workflow at real work this week, the starting contract is four moves:

1

State acceptance as behavior

Write what a correct result must do, in observable terms, before you describe the task. A subagent that rewrites the code must not be able to rewrite the definition of done.

2

Add one adversarial pass

Spawn at least one agent whose job is to refute the result, not approve it, and require a majority to let a claim survive.

3

Move judgment into deterministic gates

Tests, types, lint, schema, build. Anything a model should not get a vote on becomes code that returns pass or fail, not a prose instruction.

4

Set the stop condition and budget

Decide what halts the fleet: gates-pass, k-empty-rounds, or a token ceiling. The 1,000-agent cap is a backstop, not a plan.

What you are designing

At fleet scale you are not reviewing code anymore. You are designing the thing that reviews code. The contract is the work, and it is the load-bearing piece of production agentic delivery: not the model, not the prompt, the deterministic distance between a change and main.

A contract is not free once you have it. It rots, it encodes assumptions that go stale, and a contract nobody owns will start blocking changes that should ship. Treat it like any other piece of infrastructure: give it an owner, a review cadence, and a way to change it on purpose. The failure mode there is a different kind of false confidence, this time in the checker instead of the code, and it is the next problem you sign up for once you take this one seriously.

If your team is about to point a fleet at the codebase and you are not certain what catches a bad result, that is the conversation I have most often. Designing the verification layer before the agents run is the engagement that pays for itself, because the alternative is finding out what your contract was missing one production incident at a time. Book a 15-minute call and we will sketch the first contract for your workflow together. If you would rather start with the orchestration mechanics, read the four sub-agent patterns first, then come back and design the gate.