claude-apiclaude-codeai-strategy

Claude Opus 4.7 Is a Split, Not an Upgrade

April 16, 2026 ·6 min read · Mitchel Lairscey
In this post

For the last year, I've been telling engineering leaders to pick a Claude model by benchmark delta. Run the eval, pick the top number, move on. Claude Opus 4.7 is making me rewrite that advice.

The release that shipped this morning isn't a smarter Opus. It's a narrower one, tuned for a role the rest of the lineup doesn't fill. And if you're routing Opus and Sonnet as if they're interchangeable tiers of the same product, one of them is about to start costing you more than it should.

The line in the changelog that matters isn't the SWE-bench Pro score. Eight paragraphs deep, the notes describe a behavior change: Opus 4.7 now makes fewer tool calls by default, using reasoning more, and spawns fewer subagents by default. Read that as product strategy, not as tuning.

What shipped, and what the recaps missed

The headline facts are the ones every outlet ran by lunchtime. claude-opus-4-7 is now generally available at the same $5 / $25 per million tokens as Opus 4.6, with a 1M-token context window, 128k max output, and adaptive thinking. GitHub shipped it in Copilot same-day. AWS shipped it in Bedrock same-day. Fastest ecosystem rollout for an Opus release yet.

The benchmark numbers are concrete but concentrated. SWE-bench Verified at 87.6%. Terminal-Bench 2.0 at 69.4%. SWE-bench Pro jumped to 64.3%, up eleven points from Opus 4.6. On Anthropic's internal 93-task coding benchmark, 4.7 resolved 13% more tasks than 4.6, including four that neither 4.6 nor Sonnet 4.6 could solve at all. Rakuten-SWE-Bench: three times more production tasks resolved. The gap is widest where the workload runs long and tool-heavy, and narrowest where it doesn't.

Where it gets interesting is the feature set you can't buy on Sonnet at any price:

CAPABILITY OPUS 4.7 SONNET 4.6 task_budget (advisory token cap across loop) Yes No xhigh reasoning effort level Yes No 2576px 1:1 vision (computer use) Yes No File-system memory across turns Yes No Max output tokens 128k 64k Latency profile Moderate Fast

None of these are incremental. task_budget is a new primitive: the model sees a running token countdown across the entire agentic loop, including its own thinking and tool calls, and paces itself. xhigh is an effort level above anything Sonnet can invoke. The 1:1-pixel vision mode is tuned specifically for computer-use workflows where coordinate math matters. File-system memory persists scratch state across turns. Add the behavioral defaults, fewer tool calls and fewer subagents, and the picture is consistent: Anthropic isn't iterating on a generalist. They're specializing an operator.

The routing decision got harder, not easier

The old routing rule was a cost ladder. Opus for hard things, Sonnet for everything else, Haiku if latency matters. Pick the tier, pay the tier, done.

4.7 makes that rule wrong in both directions. If you read Anthropic's models overview carefully, the three models aren't described by intelligence level anymore. Opus 4.7 is "for complex reasoning and agentic coding." Sonnet 4.6 is "the best combination of speed and intelligence." Haiku 4.5 is "the fastest with near-frontier intelligence." Those are workload-shape descriptions, not quality grades.

Three axes matter more than the benchmark delta when you're making the call:

AxisWhen Opus 4.7 earns its premiumWhen Sonnet 4.6 is still the right call
Trace length10+ turns, tool-heavy, hours of runtimeSingle-turn, short-edit, one-shot
Tool-call densityDense, with high cost per failed callSparse, with cheap retries
SupervisionUnattended or lightly supervisedDeveloper in the loop

Cognition reports Opus 4.7 "works coherently for hours and pushes through hard problems rather than giving up." Notion measured one-third the tool errors of Opus 4.6 on complex multi-step workflows, at fewer tokens. Neither of those gains shows up on a single-turn benchmark, and neither is available to you if your production traces are five turns long.

There's also a sleeper cost story nobody in the release coverage has drawn out. Opus 4.7 ships with a new tokenizer that uses roughly 1x to 1.35x as many tokens on the same input as prior models, and at higher effort levels the model "thinks more" before committing to an action. Those two effects compound inside a multi-turn agent. The per-token sticker says Opus is 5x Sonnet. Inside a 20-turn trace, the effective ratio can stretch to 8-10x. The flip side is that the same compounding is what buys you reliability where Sonnet's quality degrades non-linearly.

Tip

The routing question isn't "which model is smarter?" It's "what's the failure cost per turn, and how many turns am I running?" Measure the trace, not the task.

I ran into this directly building the AI Persona Profiler. The pipeline spawns more than ten Opus instances per run, with adversarial dual-analysis, challenger reconciliation, and scored voice fidelity across a twelve-point rubric. That pattern hit 59 out of 60 on voice simulation across five tests. It would not have been possible on Sonnet, not because Sonnet can't reason, but because the loop is exactly where the "spawn fewer subagents, reason more" defaults pay off. 4.7 makes that kind of pipeline more reliable, not more novel. For shorter, tool-heavy workflows, the same pipeline on Sonnet would be faster and cheaper and fine.

Practitioner routing guides have converged on an 80/20 split: roughly 80% of work on Sonnet, 20% on Opus, for 60 to 80% cost savings versus default-Opus routing. That heuristic is still right on volume. What changed is that the 20% is now where Opus 4.7 pulls further away.

The convergence argument, and where it breaks

The strongest counter to everything above is the convergence view. Sonnet 4.6 is roughly 1.2 points behind Opus on single-turn coding, statistically indistinguishable on computer-use benchmarks, and one-fifth the price. Buy Opus by default and you're paying 5x for a delta that might be inside benchmark noise. Why treat 4.7 as a bifurcation signal when Sonnet has been quietly eating the same territory?

For the median API call, that argument is right. If you're classifying tickets, summarizing documents, writing single-file changes, or shipping one-shot content, Sonnet wins and the premium on Opus buys you nothing you can measure. I'd route that work to Sonnet today, same as I would have yesterday.

What the convergence argument misses is that single-turn benchmarks are not where 4.7 was tuned. SWE-bench Pro is a different kind of test from SWE-bench Verified, with longer traces, more repository context, and more tool interaction. Opus 4.7 picked up eleven points there while moving barely at all on the simpler evals. A model that got generally smarter would move everywhere. A model that got specifically better at the work Anthropic is telling you to run on it would look exactly like this.

The product strategy is the tell. Anthropic didn't ship 4.7 with a new tokenizer, task_budget, and xhigh effort because they were neutral tuning decisions. They shipped them because those are the features you need when a model is running unsupervised for hours at a time. If Opus were on track to converge with Sonnet, those primitives wouldn't be Opus-only. The fact that they are is a commitment, not a coincidence.

What this means for your Monday

For engineering leaders, the audit is straightforward. Pull the last thirty days of your production Claude usage and sort by trace length, not by task label. Long agent runs on Sonnet pay a reliability tax you don't see on any single invoice, because the cost is split across a thousand retries. Short chat calls on Opus pay a premium for work the smaller model would have handled without a measurable quality difference. Most teams are making both mistakes simultaneously and calling the sum "our Claude bill."

For practitioners, three concrete moves this week:

  1. Instrument one production workload end to end. Tokens, turns, tool calls, failure rate. That's your routing input. Guessing is how you end up in the wrong column of the table above.
  2. Try task_budget on your longest agent run, on Opus 4.7. Anthropic just published a cost-governance primitive they didn't have to ship. Use it before your finance team asks questions.
  3. Run the same workload on Sonnet 4.6 with a retry wrapper. If total cost lands within 30% of Opus at equivalent quality, keep Sonnet. If Opus completes where Sonnet stalls, 4.7 earned the move.

The meta-move is the harder one. Stop thinking about Claude as one model with price tiers. Think about it as a small fleet with role-shaped jobs. For a concrete mapping of models to orchestration roles, see the four sub-agent orchestration patterns: planners and synthesizers on Opus, executors and fan-out workers on Sonnet. That's the strategy Anthropic just announced in product form, and most routing architectures were designed under the old rule. The teams that rebuild their routing around task shape rather than task label are going to spend less and ship more reliable agents. The teams that don't will spend more training budget on tools that don't stick and wonder why their AI bill keeps drifting up.

The next six months

Opus 4.7 isn't the end of this split. Anthropic's Mythos preview is still behind glass, with its own role carved out for cybersecurity research workloads. The architecture looks stable: one operator model, one workhorse model, one fast-turn model, each tuned for a different shape of work. The next Sonnet release will probably widen the gap in the opposite direction, taking on more of the fast-turn territory that used to justify Haiku.

The routing decisions that matter here aren't about which model is smartest. They're about which workloads you've measured. If you haven't traced a production run in the last month, that's where to start. When you have the trace, the Claude API work I take on almost always begins by reading one: token counts, turn-by-turn tool calls, retry patterns, and the shape of where the model was asked to do something it wasn't built for. Then the routing decisions write themselves.

If you want a second pair of eyes on a specific production trace, that's the kind of conversation I'm set up to have. Book thirty minutes and bring the numbers.


Want to talk about how this applies to your team?

Book a Free Intro Call

Not ready for a call? Take the free AI Readiness Assessment instead.

Keep reading