Claude Opus 4.7 Is a Split, Not an Upgrade

For the last year, I've been telling engineering leaders to pick a Claude model by benchmark delta. Run the eval, pick the top number, move on. Claude Opus 4.7 is making me rewrite that advice.

The release that shipped this morning isn't a smarter Opus. It's a narrower one, tuned for a role the rest of the lineup doesn't fill. And if you're routing Opus and Sonnet as if they're interchangeable tiers of the same product, one of them is about to start costing you more than it should.

The line in the changelog that matters isn't the SWE-bench Pro score. Eight paragraphs deep, the notes describe a behavior change: Opus 4.7 now makes fewer tool calls by default, using reasoning more, and spawns fewer subagents by default. This is a shift in how tool use is distributed across a reasoning-and-execution pattern. Read that as product strategy, not as tuning.

What shipped, and what the recaps missed

The headline facts are the ones every outlet ran by lunchtime. claude-opus-4-7 is now generally available at the same $5 / $25 per million tokens as Opus 4.6, with a 1M-token context window, 128k max output, and adaptive thinking. GitHub shipped it in Copilot same-day. AWS shipped it in Bedrock same-day. Fastest ecosystem rollout for an Opus release yet.

The benchmark numbers are concrete but concentrated. SWE-bench Verified at 87.6%. Terminal-Bench 2.0 at 69.4%. SWE-bench Pro jumped to 64.3%, up eleven points from Opus 4.6. On a 93-task coding benchmark that GitHub reported during Anthropic's early-access testing, 4.7 resolved 13% more tasks than 4.6, including four that neither 4.6 nor Sonnet 4.6 could solve at all. Rakuten-SWE-Bench: three times more production tasks resolved. The gap is widest where the workload runs long and tool-heavy, and narrowest where it doesn't.

Where it gets interesting is the feature set you can't buy on Sonnet at any price:

The Opus-only items are not incremental. task_budget is a new primitive: the model sees a running token countdown across the entire agentic loop, including its own thinking and tool calls, and paces itself. xhigh is an effort level Sonnet cannot invoke at all (it is Opus-only). The high-resolution vision mode (up to 2576px) is tuned specifically for computer-use workflows where coordinate precision matters. File-system memory across turns is shared with Sonnet 4.6 (the memory tool is generally available across both), but it composes naturally with the agentic-loop primitives above. Add the behavioral defaults, fewer tool calls and fewer subagents, and the picture is consistent: Anthropic isn't iterating on a generalist. They're specializing an operator.

The routing decision got harder, not easier

The old routing rule was a cost ladder. Opus for hard things, Sonnet for everything else, Haiku if latency matters. Pick the tier, pay the tier, done.

4.7 makes that rule wrong in both directions. If you read Anthropic's models overview carefully, the three models aren't described by intelligence level anymore. Opus 4.7 is "for complex reasoning and agentic coding." Sonnet 4.6 is "the best combination of speed and intelligence." Haiku 4.5 is "the fastest with near-frontier intelligence." Those are workload-shape descriptions, not quality grades.

Three axes matter more than the benchmark delta when you're making the call:

Axis	When Opus 4.7 earns its premium	When Sonnet 4.6 is still the right call
Trace length	10+ turns, tool-heavy, hours of runtime	Single-turn, short-edit, one-shot
Tool-call density	Dense, with high cost per failed call	Sparse, with cheap retries
Supervision	Unattended or lightly supervised	Developer in the loop

Cognition reports Opus 4.7 "works coherently for hours and pushes through hard problems rather than giving up." Notion measured one-third the tool errors of Opus 4.6 on complex multi-step workflows, at fewer tokens. Neither of those gains shows up on a single-turn benchmark, and neither is available to you if your production traces are five turns long.

There's also a sleeper cost story nobody in the release coverage has drawn out. Opus 4.7 ships with a new tokenizer that uses roughly 1x to 1.35x as many tokens on the same input as prior models, and at higher effort levels the model "thinks more" before committing to an action. Those two effects compound inside a multi-turn agent. The per-token sticker says Opus is about 1.7x Sonnet (Opus 4.7 at $5/$25 per million input/output tokens, Sonnet 4.6 at $3/$15). Inside a 20-turn trace, the denser tokenizer and the extra thinking stretch that effective gap wider still. The flip side is that the same compounding is what buys you reliability where Sonnet's quality degrades non-linearly. I worked through the full per-cell pricing across the Claude lineup, plus where prompt caching inverts this calculus for cache-stable workloads, in the Model x Effort Matrix.

Tip

The routing question isn't "which model is smarter?" It's "what's the failure cost per turn, and how many turns am I running?" Measure the trace, not the task.

I ran into this directly building the AI Persona Profiler. The pipeline spawns more than ten Opus instances per run, with adversarial dual-analysis, challenger reconciliation, and scored voice fidelity across a twelve-point rubric. That pattern hit 59 out of 60 on voice simulation across five tests. It would not have been possible on Sonnet, not because Sonnet can't reason, but because the loop is exactly where the "spawn fewer subagents, reason more" defaults pay off. 4.7 makes that kind of pipeline more reliable, not more novel. For shorter, tool-heavy workflows, the same pipeline on Sonnet would be faster and cheaper and fine.

Practitioner routing guides have converged on an 80/20 split: roughly 80% of work on Sonnet, 20% on Opus, for 60 to 80% cost savings versus default-Opus routing. That heuristic is still right on volume. What changed is that the 20% is now where Opus 4.7 pulls further away.

The convergence argument, and where it breaks

The strongest counter to everything above is the convergence view. Sonnet 4.6 is roughly 1.2 points behind Opus on single-turn coding, statistically indistinguishable on computer-use benchmarks, and roughly 40% cheaper. Buy Opus by default and you're paying about 1.7x for a delta that might be inside benchmark noise. Why treat 4.7 as a bifurcation signal when Sonnet has been quietly eating the same territory?

For the median API call, that argument is right. If you're classifying tickets, summarizing documents, writing single-file changes, or shipping one-shot content, Sonnet wins and the premium on Opus buys you nothing you can measure. I'd route that work to Sonnet today, same as I would have yesterday.

What the convergence argument misses is that single-turn benchmarks are not where 4.7 was tuned. SWE-bench Pro is a different kind of test from SWE-bench Verified, with longer traces, more repository context, and more tool interaction. Opus 4.7 picked up eleven points there while moving barely at all on the simpler evals. A model that got generally smarter would move everywhere. A model that got specifically better at the work Anthropic is telling you to run on it would look exactly like this.

The product strategy is the tell. Anthropic didn't ship 4.7 with a new tokenizer, task_budget, and xhigh effort because they were neutral tuning decisions. They shipped them because those are the features you need when a model is running unsupervised for hours at a time. If Opus were on track to converge with Sonnet, those primitives wouldn't be Opus-only. The fact that they are is a commitment, not a coincidence.

What this means for your Monday

For engineering leaders, the audit is straightforward. Pull the last thirty days of your production Claude usage and sort by trace length, not by task label. Long agent runs on Sonnet pay a reliability tax you don't see on any single invoice, because the cost is split across a thousand retries. Short chat calls on Opus pay a premium for work the smaller model would have handled without a measurable quality difference. Most teams are making both mistakes simultaneously and calling the sum "our Claude bill."

For practitioners, three concrete moves this week:

Instrument one production workload end to end. Tokens, turns, tool calls, failure rate. That's your routing input. Guessing is how you end up in the wrong column of the table above.
Try task_budget on your longest agent run, on Opus 4.7. Anthropic just published a cost-governance primitive they didn't have to ship. Use it before your finance team asks questions.
Run the same workload on Sonnet 4.6 with a retry wrapper. If total cost lands within 30% of Opus at equivalent quality, keep Sonnet. If Opus completes where Sonnet stalls, 4.7 earned the move.

The meta-move is the harder one. Stop thinking about Claude as one model with price tiers. Think about it as a small fleet with role-shaped jobs. For a concrete mapping of models to subagent orchestration roles, see the four sub-agent orchestration patterns: planners and synthesizers on Opus, executors and fan-out workers on Sonnet. That's the strategy Anthropic just announced in product form, and most routing architectures were designed under the old rule. The teams that rebuild their routing around task shape rather than task label are going to spend less and ship more reliable agents. The teams that don't will spend more training budget on tools that don't stick and wonder why their AI bill keeps drifting up.

The next six months

Opus 4.7 isn't the end of this split. Anthropic's Mythos preview is still behind glass, with its own role carved out for cybersecurity research workloads. The architecture looks stable: one operator model, one workhorse model, one fast-turn model, each tuned for a different shape of work. The next Sonnet release will probably widen the gap in the opposite direction, taking on more of the fast-turn territory that used to justify Haiku. The operator side's next release has since landed as Claude Opus 4.8, and it widened the split along an axis this post did not predict: not raw capability, but honesty about the model's own work.

The routing decisions that matter here aren't about which model is smartest. They're about which workloads you've measured. If you haven't traced a production run in the last month, that's where to start. When you have the trace, the Claude API work I take on almost always begins by reading one: token counts, turn-by-turn tool calls, retry patterns, and the shape of where the model was asked to do something it wasn't built for. Then the routing decisions write themselves.

If you want a second pair of eyes on a specific production trace, that's the kind of conversation I'm set up to have. Book fifteen minutes and bring the numbers.

Glossary terms used

Subagent orchestration Tool use

Claude Opus 4.7 Is a Split, Not an Upgrade

What shipped, and what the recaps missed

The routing decision got harder, not easier

The convergence argument, and where it breaks

What this means for your Monday

The next six months

Subagent Orchestration in Production: Trade-offs and Failure Modes

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Continue reading: more in Build with Claude

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Opus 4.8 vs 4.7, One Week Later: The Upgrade Call I Couldn't Make on Day One

Sources

What shipped, and what the recaps missed

The routing decision got harder, not easier

The convergence argument, and where it breaks

What this means for your Monday

The next six months

Reference guides for this topic

Subagent Orchestration in Production: Trade-offs and Failure Modes

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Continue reading: more in Build with Claude→

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Opus 4.8 vs 4.7, One Week Later: The Upgrade Call I Couldn't Make on Day One

Sources

Continue reading: more in Build with Claude