claude-apiclaude-codeai-strategyleadership

Claude Opus 4.7: Your Eval Harness Can't See What Just Changed

April 17, 2026 ·22 min read · Mitchel Lairscey
In this post

Claude Opus 4.7 shipped on April 16, 2026. It jumped 10.9 points on SWE-bench Pro, moved 6.8 points on SWE-bench Verified, and quietly regressed 4.4 points on BrowseComp. Three benchmarks, one release, three different verdicts about whether the model got better.

Your internal eval almost certainly cannot tell you which of the three numbers applies to your workload. That is the problem.

The release notes lead with task budgets, xhigh reasoning effort, high-resolution vision, and an adaptive-thinking-only API. Those features ship as advertised. But the feature drop is not where the upgrade lives for most teams. The upgrade lives in a set of operator-model behaviors that the evals most teams run cannot see: longer traces that stay coherent, fewer tool calls with fewer errors, more literal instruction-following that breaks prompts tuned on Opus 4.6, and better file-system memory that rewards good scratchpad design and punishes lazy checkpoint architecture. None of those show up on a single-turn coding benchmark.

Anthropic said the quiet part themselves. Their official best-practices post for Opus 4.7 contains this sentence: "If your code-review harness was tuned for an earlier model, you may initially see lower recall. This is likely a harness effect, not a capability regression." Their Head of Claude Code, Boris Cherny, posted that it "took a few days for me to learn how to work with it effectively." When the team that built the model is telling you the old harness lies, your static eval is telling you even less.

The thesis of this post is narrow and defensible. Opus 4.7 is a genuine improvement over 4.6 on the workloads it was specialized for. That improvement is largely invisible to the evals most teams are running. Teams that rebuild their harnesses around the four dimensions where 4.7 pulls away will find out whether the premium earns its way on their specific production mix. Teams that do not will either walk away from a model that would have paid for itself or route to it blind and watch the bill climb.

This post is the operator's guide to Opus 4.7. What shipped. What the benchmarks really say. Why your eval cannot see the upgrade. How to rebuild it. The routing and migration consequences. And the silent regressions most migration checklists are missing.

What shipped on April 16, 2026: Claude Opus 4.7 in one page

Claude Opus 4.7 is Anthropic's generally available flagship coding and agentic-reasoning model, released April 16, 2026 at the same per-token pricing as Opus 4.6. It ships with a 1M token context window, 128k max output on the Messages API (300k via the Batches beta), and adaptive thinking as the only supported thinking mode. Same-day availability across Amazon Bedrock, Google Vertex AI, Microsoft Foundry, Snowflake Cortex, GitHub Copilot, Cursor, Harvey, and Cognition's Devin. Model ID: claude-opus-4-7.

That is the part you can quote. The rest matters more if you are running 4.7 in production.

The features most teams will touch first

task_budget is a new beta primitive on Opus 4.7. It lives at output_config.task_budget and takes an integer number of tokens (minimum 20,000). The budget covers the whole agentic loop: thinking, tool calls, tool results, and final output. Anthropic's docs describe it as "not a hard cap" but "a suggestion that the model is aware of." The model sees a running countdown server-side and paces itself toward a graceful finish. Your max_tokens is still the hard per-request ceiling. Requires the task-budgets-2026-03-13 beta header. Opus-only at launch, not supported on Claude Code or Cowork surfaces.

xhigh effort is a new level on the effort ladder, which now runs low / medium / high / xhigh / max on 4.7. Five levels, not six. It lives at output_config.effort, defaults to high, and Anthropic recommends "xhigh for coding and agentic use cases, and high as the minimum for most intelligence-sensitive workloads." xhigh costs meaningfully more than high. max exists on both 4.7 and 4.6; xhigh is Opus 4.7 only.

Adaptive thinking replaces extended thinking. On 4.7, the only thinking-on mode is thinking: {type: "adaptive"}. Setting thinking: {type: "enabled", budget_tokens: N} returns a 400. Sonnet 4.6 and Opus 4.6 still accept budget_tokens but deprecate it. Two silent regressions worth knowing now: adaptive thinking is OFF by default on 4.7 (omit the field, get zero thinking), and thinking.display defaults to "omitted" (apps rendering reasoning show nothing unless you set display: "summarized" explicitly). Interleaved thinking auto-enables under adaptive; remove the legacy beta header if you had it.

High-resolution vision maxes at 2576 pixels / 3.75 megapixels, up from 1568 / 1.15. Model-returned coordinates are 1:1 with pixels, so no scale-factor math for pointing or bounding boxes. Automatic on 4.7, no beta header required. Image tokens run roughly 3x higher at full resolution than 4.6, so image-heavy workloads need more max_tokens headroom.

File-system memory improvements are a capability claim, not a new API surface. The pre-existing memory_20250818 tool is the mechanism; 4.7 is better at using it. Anthropic's docs: "Claude Opus 4.7 is better at writing and using file-system-based memory." Scratchpads, notes files, structured memory stores that carry across turns all benefit.

New tokenizer produces 1.0 to 1.35x as many tokens per text compared to prior models, with the upper end most common on code, structured data, and non-English content. Per-token pricing is unchanged. Effective cost per equivalent prompt rises. Anthropic's own advice: audit max_tokens headroom and re-run /v1/messages/count_tokens before committing to cost projections.

Behavioral defaults ship differently than 4.6. The what's-new doc: "Fewer tool calls by default, using reasoning more. Raising effort increases tool usage." And: "Fewer subagents spawned by default. Steerable through prompting." Response length calibrates to task complexity. Instruction-following is more literal. Tone is more direct with fewer emoji. These are the defaults that feed the eval problem this post is about, because they are invisible to single-turn harnesses.

On the feature tour, most coverage stops. The more interesting question is what this configuration tells you about where the model was specialized, and whether your production workload lives inside that specialization or next to it.

The Claude Opus 4.7 benchmark divergence: headline wins, quiet regressions

Three clusters of benchmark results tell you most of what you need to know about how Anthropic tuned 4.7. They also tell you why any single "did the model get better" answer is misleading.

Where 4.7 pulled away

The big gains cluster on long-trace, tool-heavy, multi-step work.

  • SWE-bench Pro: 53.4% to 64.3% (+10.9 points), ahead of GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. SWE-bench Pro is 1,865 tasks across 41 production-grade repositories; each task averages roughly 107 lines of code across about four files.
  • Rakuten-SWE-Bench: 3x more production tasks resolved than Opus 4.6, with double-digit gains in Code Quality and Test Quality. Anthropic positions this as a production-grade counterpart to SWE-bench Verified.
  • CursorBench: 58% to 70%+. Cursor CEO Michael Truell: "Claude Opus 4.7 is a very impressive coding model, particularly for its autonomy and more creative reasoning. On CursorBench, Opus 4.7 is a meaningful jump in capabilities, clearing 70% versus Opus 4.6 at 58%."
  • Anthropic's internal 93-task coding benchmark: +13% resolution over 4.6, including four tasks that neither 4.6 nor Sonnet 4.6 could solve at all.
  • Terminal-Bench 2.0: 65.4% to 69.4%. A four-point gain. GPT-5.4 leads this benchmark at 75.1%, so Opus is not the frontier here.
  • MCP-Atlas: 77.3%, the top score of any frontier model on this benchmark. Vellum reads it correctly: "MCP-Atlas measures performance across complex, multi-turn tool-calling scenarios. It's the closest thing to a real production agent benchmark."

Cognition's Scott Wu summarized what this looks like at scale: "Claude Opus 4.7 takes long-horizon autonomy to a new level in Devin. It works coherently for hours, pushes through hard problems rather than giving up, and unlocks a class of deep investigation work we couldn't reliably run before." Notion's Sarah Sachs reported a third the tool errors of 4.6 on complex multi-step workflows at fewer tokens, alongside a 14% lift in resolution over 4.6.

Every data point in this cluster describes the same workload shape: long traces, dense tool use, running mostly unsupervised.

Where it is inside the noise

On the shorter, simpler, frontier-capability benchmarks, the gap narrows to the point of irrelevance.

  • SWE-bench Verified: 80.8% to 87.6%. A visible jump on paper, but the benchmark was saturated before 4.7 shipped. Six pre-4.7 frontier models sit within 1.3 points of each other on the March 2026 leaderboard (Opus 4.5 at 80.9%, Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, MiniMax M2.5 at 80.2%, GPT-5.2 at 80.0%, Sonnet 4.6 at 79.6%). Opus 4.7 at 87.6% breaks out of that cluster, but the benchmark is approaching a ceiling that narrows interpretive signal.
  • GPQA Diamond: Opus 4.7 (94.2%), GPT-5.4 Pro (94.4%), Gemini 3.1 Pro (94.3%). Within 0.2 points. You cannot distinguish the models here.

Worse, METR has independently shown that "roughly half of test-passing SWE-bench Verified PRs" written by mid-2024 through late-2025 agents would not be merged into main by repo maintainers. Their analysis of 296 AI-generated PRs from scikit-learn, Sphinx, and pytest found an average 24-percentage-point gap between SWE-bench scores and actual maintainer merge decisions. Whatever Opus 4.7's 87.6% means, it does not mean you have an AI engineer whose PRs your team should merge at an 87.6% rate.

Where it regressed

One number matters here, and it is pointedly not in Anthropic's headline chart.

  • BrowseComp: 83.7% to 79.3% (−4.4 points) at a 10M-token test-time-compute scaling limit. Anthropic acknowledges 4.6 has the better deep-research scaling curve at that budget. GPT-5.4 Pro scores 89.3% on the same benchmark.

BrowseComp measures multi-step web research: browse, synthesize, reason across pages. If your agent workload lives there, Opus 4.7 is slightly worse than 4.6 and meaningfully worse than GPT-5.4. Anthropic is not hiding this number. They are not leading with it either.

Anthropic also deliberately scaled back 4.7's cybersecurity vulnerability-reproduction capabilities as a policy decision. A rare public case of regressing a dimension on purpose. Named in the system card (CyberGym scores dropped from 73.8% on 4.6 to 73.1% on 4.7; Anthropic describes "experiments with efforts to differentially reduce these capabilities" during training), picked up by The Decoder and LessWrong. Worth knowing if you ran red-team or vuln-research workflows on 4.6.

The three clusters at a glance

BenchmarkOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
SWE-bench Pro (long-trace coding)53.4%64.3%57.7%54.2%
Rakuten-SWE-Bench (production)baseline3x resolvedn/an/a
CursorBench (autonomous coding)58%70%+n/an/a
Terminal-Bench 2.0 (CLI agent)65.4%69.4%75.1%n/a
SWE-bench Verified (single-turn)80.8%87.6%n/a80.6%
GPQA Diamond (reasoning, saturated)n/a94.2%94.4%94.3%
BrowseComp (deep research, 10M)83.7%79.3%89.3%n/a

The pattern is consistent. 4.7 is a specialization upgrade, not a generalist one. Measurably better where the workload runs long and tool-heavy. Inside noise on short single-turn frontier evals. Meaningfully worse on one specific research-scaling dimension. That is a product decision, not a tuning accident.

Which brings us to the question this post exists to answer: does your eval suite know which of those three buckets your actual workload lives in?

Why your single-turn eval harness can't see the Claude Opus 4.7 upgrade

Anthropic's own post on Opus 4.7 best practices contains the thesis of this piece. "If your code-review harness was tuned for an earlier model, you may initially see lower recall. This is likely a harness effect, not a capability regression." That sentence should be the most-quoted line of the release. It is also the one teams running point-in-time benchmarks against their vendor-evaluation checklist are most likely to miss.

Here is what that "harness effect" covers, broken into four dimensions your current eval suite is probably not measuring.

1. Trace length

Single-turn evals score a one-shot answer to a one-shot prompt. Opus 4.7's biggest gain is coherence over long traces. Cognition's Scott Wu described the pipeline class they unlocked as "deep investigation work we couldn't reliably run before." Anthropic's own engineering blog defines the unit: "a transcript, also called a trace or trajectory, is the complete record of a trial, including outputs, tool calls, reasoning, intermediate results, and any other interactions." If your grader only sees the final response, you are grading the last mile. The model's gains are in the nineteen miles before it.

Academic research backs this up. The OpenReview paper on multi-turn code generation benchmarked 32 models on a multi-turn security corpus called MT-Sec. The headline finding: a consistent 20 to 27 percent drop in "correct and secure" outputs from single-turn to multi-turn settings, even among the frontier closed-source models. Better single-turn performance does not entail better multi-turn performance. A single-turn harness is a lower bound on what your model does in production, not a predictor.

2. Tool-call density and recovery

Notion reports a third the tool errors of Opus 4.6 on complex multi-step workflows, at fewer tokens. "Fewer" is a word that does not appear on a single-turn benchmark. The benchmark assumes one tool call per turn, or no tool calls at all. Production agents that call six tools per turn and chain fifteen turns are running inside an error-compounding system where Opus 4.7's tool-call discipline is the entire game. Pick the right tool first, save a turn. Recover gracefully when a tool fails, save the whole run. Your eval either measures this or it does not.

3. Instruction literalism

This is the one your static eval cannot surface at all, because it is a property of your production prompts, not of the model.

Gabriel Anhaia, a practitioner who spent six hours with 4.7 on release day, summarized a behavior change that matters: prompts written for Opus 4.6 that have sloppy instructions, ambiguous scopes, or contradictory constraints will follow the letter of those instructions on 4.7, not the spirit. Anthropic's own guidance reinforces this. Opus 4.7 "interprets prompts more literally and explicitly than Claude Opus 4.6, particularly at lower effort levels." It will not silently generalize an instruction from one item to another. It will not infer requests you did not make.

Evals run on clean prompts. Production traffic runs on the prompt your busy product manager wrote eighteen months ago and nobody has rereviewed since. Your eval will pass. Your production recall will drop.

Important

The eval harness that proved Opus 4.6 was adequate on your workload is probably the same harness that will tell you 4.7 is worse. It is not. You have just switched to a model that takes your prompts at face value.

4. Checkpoint integrity

Opus 4.7 spawns fewer subagents by default and reasons more between tool calls. Combined with improved file-system memory, the model now rewards architectures that write good scratchpad notes and punishes architectures that depend on the model re-deriving context from scratch every turn. This is a structural property, not a per-trial measurement. You surface it only by running agents that persist state across turns and evaluating the quality of handoffs, checkpoints, and recovery-from-disk. It does not appear on any public benchmark.

FOUR DIMENSIONS YOUR SINGLE-TURN EVAL CANNOT SEE 1. Trace length Single-turn grades the last mile. Opus 4.7's gains are in the nineteen miles before it. 2. Tool-call density Notion: one-third the tool errors of 4.6 on complex multi-step workflows, at fewer tokens. 3. Instruction literalism Evals run on clean prompts. Production runs on the prompt nobody has rereviewed since 2024. 4. Checkpoint integrity 4.7 rewards scratchpad design. Fewer subagents by default. State lives on disk, not in context.

The steel man

The sharpest counter to this thesis: rebuilding an eval harness is a three-to-six-week project, and for many production workloads the answer after rebuild is "Opus 4.6 was fine." That is correct. For RAG pipelines, classification, single-file changes, document summarization, and most of the work that lives inside a five-turn budget, the rebuild does not pay for itself because there is no delta to measure. Opus 4.7 is meaningfully better only where the workload is meaningfully agentic.

The steel man sharpens the thesis rather than refuting it. You cannot know which side of that line your workload falls on without the rebuild. The rebuild is not a commitment to upgrade. It is a commitment to stop guessing.

The experiential part

I have built and shipped eval harnesses for long-running agent loops, including the twenty-turn AI Readiness Assessment that runs on Sonnet 4.6 with adaptive thinking. When I benchmarked the Assessment against a single-turn rubric during development, the single-turn version missed the failure modes that drove the redesign in production. Nothing about that lesson was 4.7-specific. What 4.7 changed is that the penalty for continuing to use a single-turn rubric now bites harder, because the model on the other side of your harness is quietly much better at things the harness does not measure.

If the thesis is right, the practical response is a rebuild. The rest of this post is about what the rebuild looks like.

What to measure instead: rebuilding your eval for operator-model workloads

The good news is that Anthropic has published the canonical guide to this exact rebuild, and almost nobody is reading it as a guide to Opus 4.7.

Anthropic's engineering post on demystifying evals for AI agents predates Opus 4.7 by three months, but it defines every piece of the harness you now need. A trace is the complete record of a trial. A grader scores some aspect of performance. A task can have multiple graders, each with multiple assertions. Anthropic's own advice: start with 20 to 50 simple tasks drawn from production failures, then iterate.

If you apply that definition honestly, your existing single-turn benchmark is not an eval. It is a vibe check with a score attached.

Pull your failures, not your successes

The first practical step is to ignore every task where your current production agent works well. Pull the 20 to 50 cases where it visibly struggled over the past 30 days: loops that never finished, answers that broke under follow-up questions, tool chains that gave up halfway through, PRs that passed your CI and still got rejected in review. Those are your seed tasks. Your production failures, not synthetic ones.

The reason this step matters is that most publicly available evals were built by someone who did not run your agent. Their failures are not your failures. An eval tuned against your own production transcripts is an eval that will surface what Opus 4.7 changes for you, not what it changes for someone else's coverage metric.

Grade the trace, not the answer

For each seed task, capture the full trace: every tool call, every thinking segment, every retry, every intermediate result. Anthropic's own language is precise about this. "When a task fails, the transcript tells you whether the agent made a genuine mistake or whether your graders rejected a valid solution."

Then write graders that score the four dimensions from the previous section.

  • Trace-length grader. Does the agent stay coherent across 10+ turns? Does it remember what it did on turn 3 when asked about it on turn 17?
  • Tool-call-density grader. What fraction of tool calls succeeded on first try? When a tool failed, did the model recover or spiral?
  • Instruction-literalism grader. Feed two versions of the same prompt: one clean, one with the sloppy phrasing your production traffic contains. How far did the outputs diverge?
  • Checkpoint-integrity grader. Kill the trace mid-run. Rehydrate from scratchpad notes. Can the agent pick up where it left off, or does it start over?

None of those graders is a line of pass/fail test code. Each is a small harness of its own: structured scoring, potentially with a model-as-judge layer for graders that need qualitative assessment. Anthropic's post walks through how to build these. Most teams skip the work because it is hard, and then wonder why the benchmark numbers do not match the production bill.

The proof-in-the-wild case

The clearest public example of what this kind of eval surfaces comes from CodeRabbit. They run 100 known issues across pull requests in major open-source projects. Each issue maps to a specific verified bug that a good code reviewer should catch. The harness is the same test set from run to run.

When they swapped Opus 4.6 for Opus 4.7 inside their existing code-review harness, the score went from 55 of 100 to 68 of 100. A 24% relative improvement on the same test set. Recall improved in double digits; precision did not regress. No retune required. Model swap only.

That number exists because CodeRabbit had already done the rebuild. Their harness scores whether the model catches the specific bug that matters, not whether the model produces a clean diff for a synthetic task. The work they did last quarter paid for itself the moment a new model shipped.

If you are building this for the first time

A workable minimum viable eval for an operator-model workload has four things:

  1. A corpus of 20 to 50 production traces, drawn from failures and edge cases, not happy paths.
  2. A trace replay mechanism, so you can run the same inputs through new model versions and compare turn-by-turn.
  3. Four graders covering trace length, tool-call recovery, instruction-literalism sensitivity, and checkpoint integrity. Rubric scored, not pass/fail.
  4. A shadow-traffic channel, so you can run 4.7 against a slice of live traffic in parallel with 4.6 and compare the full trace, not just the final answer.

If you want a second pair of eyes on this rebuild for a specific production workload, that conversation is exactly what my advisory engagements cover. The rebuild is not where the expensive mistakes happen. The routing and migration decisions that get made against a thin eval are.

Routing Claude Opus 4.7 vs Sonnet 4.6 with eyes open

Once your eval can see the four dimensions that matter, the routing call writes itself. Not because the answer is obvious, but because the cost and the gain become visible for the same workload at the same time.

The shape of the decision looks like this.

DimensionWorkload characteristicDefault routing
Trace length10+ turns, tool-heavy, runs for hours unsupervisedOpus 4.7
Trace length1-5 turns, developer in the loopSonnet 4.6
Tool-call densityHigh, with compounding cost per failed callOpus 4.7
Tool-call densitySparse, cheap retriesSonnet 4.6
VisionComputer-use, high-res screenshots, document pointingOpus 4.7 (1:1 pixel coordinates are here only)
Deep researchHeavy BrowseComp-style multi-page synthesis at 10M tokensOpus 4.6 (the one benchmark 4.7 regressed on)
Single-turn codingClassification, RAG responses, one-shot changesSonnet 4.6 (Anthropic recommends it; 40% cheaper per token)

Finout's pricing analysis puts it bluntly: "Sonnet 4.6 is 40% cheaper per input token and 40% cheaper per output token than Opus, and for most production inference (classification, RAG responses, content generation, basic tool use) it remains the right default." The new tokenizer compounds that further. A prompt that cost $0.10 on Opus 4.6 can cost anywhere from $0.10 to $0.135 on 4.7, driven entirely by the 1.0 to 1.35x token inflation. On a long-running agent, that compound turns ugly quickly.

The GitHub Copilot signal

GitHub shipped Opus 4.7 at a 7.5x premium-request multiplier in Copilot Pro+, Business, and Enterprise tiers, up from 3x for Opus 4.6. Promotional through April 30, 2026; the post-promotional rate is unannounced. Even the platform that has the strongest incentive to make Claude routable for developers is signaling "use sparingly." Read that as a cost signal, not a capability signal. GitHub is not saying 4.7 is twice as smart. They are saying it is more than twice as expensive to run a request on, and they do not want every Copilot user defaulting to it.

The companion lens

For the deeper strategic take on why Anthropic is specializing Opus rather than making it a general upgrade, see the companion post Claude Opus 4.7 Is a Split, Not an Upgrade. This post covers the measurement problem; that one covers the product-strategy lens. Read them together and the routing picture is complete.

The meta-rule

Route by workload shape, not by task label. A "complex" ticket that resolves in three turns runs on Sonnet. A "simple" refactor that spawns eight subagents and browses your repo for twenty minutes runs on Opus 4.7. The benchmark number tells you almost nothing about which category your specific workload falls into. The rebuilt eval tells you everything.

Migrating from Opus 4.6 to Claude Opus 4.7 without breaking production

Opus 4.7 is better understood as an API redesign with a model upgrade attached. That framing captures the experience of running a migration more accurately than any "drop in the new model ID" guide.

There are ten breaking or silently-changed behaviors in the Opus 4.6 to Opus 4.7 migration path. Five of them throw a 400 error, which is easy to catch. Five of them change behavior silently, which is not.

The five that throw 400

  1. thinking: {type: "enabled", budget_tokens: N} returns 400. Opus 4.7 is adaptive-thinking-only. Replace with thinking: {type: "adaptive"} plus output_config: {effort: ...}.
  2. temperature, top_p, top_k at non-default values return 400. Remove them.
  3. Assistant-message prefills return 400. Use structured outputs or system prompts instead.
  4. task_budget calls without the task-budgets-2026-03-13 beta header return 400.
  5. task_budget.total values under 20,000 tokens return 400.

These are loud. Your CI will catch them if you have good error monitoring. They are not the ones that will quietly reduce quality in production.

The five silent changes

  1. Adaptive thinking is OFF by default on 4.7. If you omit the thinking field entirely, Opus 4.7 does zero thinking. Opus 4.6 defaulted to adaptive-on. Code that relied on that default gets no thinking on 4.7. No error, just dumber output.
  2. Thinking display defaults to "omitted" on 4.7, not "summarized". Applications that render reasoning for users will show empty output unless you explicitly set thinking.display: "summarized". No error.
  3. Tokenizer inflation up to 1.35x. Per-token price is unchanged; same prompts cost more. Anthropic's advice: audit max_tokens headroom, re-run /v1/messages/count_tokens, include compaction triggers in your workflow.
  4. Effort calibration is stricter at low and medium. Code that worked at effort: "low" on 4.6 may under-think on 4.7. The model now respects "low" as "do exactly what is asked, no more." Raise effort or add explicit multi-step guidance.
  5. Default behaviors shifted. Fewer tool calls, fewer subagents, more literal prompt following. Any scaffolding that nudged 4.6 ("after every 3 tool calls, summarize progress," "spawn subagents for X") should be reviewed. Anthropic's migration guide says it plainly: "A prompt and harness review may be especially useful." Read that sentence carefully.

The migration checklist, in order

  1. Update model ID from claude-opus-4-6 to claude-opus-4-7.
  2. Remove temperature, top_p, top_k from request payloads.
  3. Replace thinking: {type: "enabled", budget_tokens: N} with thinking: {type: "adaptive"} plus output_config: {effort: "high"} (or xhigh for agentic work).
  4. If your UI renders thinking, explicitly set thinking.display: "summarized".
  5. Remove assistant-message prefills. Use structured outputs or system prompts instead.
  6. Remove legacy beta headers: effort-2025-11-24, fine-grained-tool-streaming-2025-05-14, interleaved-thinking-2025-05-14.
  7. Raise max_tokens at xhigh and max effort to at least 64k. At full-resolution vision, budget roughly 3x more image tokens than on 4.6.
  8. Re-run /v1/messages/count_tokens against representative production prompts. Recalibrate cost projections.
  9. Audit all prompts for scaffolding tuned against 4.6 defaults. Remove guidance that presumes many tool calls, many subagents, or soft interpretation.
  10. Replay representative production traffic in a shadow channel. Compare full traces, not just final answers, for at least 48 hours before cutover.

Decrypt's reviewer, running 4.7 on release day, reported a single session that depleted his entire token quota for the first time in testing. The root cause was the model redoing the entire output multiple times under variants of "rewrite with bug fixes and improvements." That kind of behavioral change is visible only in the trace, not in the final artifact. If your cost projections for 4.7 come from multiplying 4.6 volumes by 1.35x and calling it a day, you are under-budgeting.

What the first week of Claude Opus 4.7 looks like for your team

If you came here wanting a concrete plan for the next five working days, here it is.

Monday. Pull the last 30 days of production traces for your highest-value Claude workload. Sort by trace length. Anything over five turns is a candidate for the operator-model bucket. Anything under is probably not a candidate for Opus 4.7 at all.

Tuesday. Select 20 to 50 traces from your long-trace bucket, weighted toward failures. Tag each with the dominant failure mode: never finished, broke on follow-up, tool chain gave up, passed CI but got rejected. That tagged set is your seed eval.

Wednesday. Build the four graders described above, at whatever fidelity you can ship in a day. Even a rough grader is more signal than the single-turn benchmark you are running now. The point of this day is to stop grading the final artifact and start grading the trace.

Thursday. Run the seed eval against Opus 4.6 (baseline) and Opus 4.7 (candidate) on shadow traffic. Measure full traces, not just answers. Compare the four graders side by side.

Friday. Decide, in this order:

  1. Does Opus 4.7 pay for itself on the workloads where the long-trace gains apply? If yes, start the migration.
  2. Is Opus 4.6 still the right choice for your BrowseComp-shaped workloads? For deep-research agents at 10M-token budgets, probably yes. Keep 4.6 in your routing.
  3. What fraction of your traffic should default to Sonnet 4.6? For most teams, a larger fraction than they think. Opus premiums apply whether the trace is long or thin. Do not pay them on thin ones.

The rebuild takes longer than five days. The decisions that come out of it are worth the investment. If you do not have the five days to spare, the cheapest possible step is still the honest one: stop trusting your current single-turn eval to tell you whether Opus 4.7 earned its premium on your production traffic. It cannot.

Claude Opus 4.7 FAQ

What is Claude Opus 4.7?

Claude Opus 4.7 is Anthropic's generally available flagship model, released April 16, 2026. It is specialized for long-horizon agentic coding and tool-heavy workloads. Features include task budgets, xhigh reasoning effort, adaptive-only thinking, 2576-pixel high-resolution vision with 1:1 pixel coordinates, improved file-system memory, and a 1M-token context window. Available on the Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry, Snowflake Cortex, GitHub Copilot, Cursor, Cognition's Devin, and Harvey.

How is Claude Opus 4.7 different from Claude Opus 4.6?

The headline difference is specialization. Opus 4.7 scores +10.9 on SWE-bench Pro, resolves 3x more Rakuten-SWE-Bench tasks, and reports one-third the tool errors on Cursor's multi-step benchmark. It also regressed 4.4 points on BrowseComp at 10M-token scaling. A new tokenizer produces 1.0 to 1.35x more tokens per same input. Adaptive thinking is now the only supported thinking mode. Full details in the Anthropic what's-new docs.

When should I use Claude Opus 4.7 vs Claude Sonnet 4.6?

Use Claude Opus 4.7 for long-trace, tool-heavy, unsupervised agent workloads: runs over ten turns, dense tool calls, high cost per failed call. Use Claude Sonnet 4.6 for everything else, which for most teams is 70 to 90 percent of API traffic. Sonnet is roughly 40 percent cheaper per token, remains near frontier on single-turn work, and is what Anthropic recommends as the default for most production inference.

How much does Claude Opus 4.7 cost?

Per-token pricing for Claude Opus 4.7 is the same as Opus 4.6. See the Anthropic pricing page for canonical rates. Effective cost per equivalent prompt is not the same, because the new tokenizer produces up to 35 percent more tokens per same input. GitHub Copilot lists Opus 4.7 at a 7.5x premium-request multiplier, promotional through April 30, 2026. Before migrating a production workload, replay traffic side by side and measure the effective cost delta.

What is xhigh effort in Claude Opus 4.7?

xhigh is a new level on Opus 4.7's effort ladder, which runs low / medium / high / xhigh / max. It sits between high and max. Anthropic recommends xhigh as the default for coding and agentic use cases. It produces meaningfully more thinking tokens than high, and therefore costs meaningfully more. Effort is set via output_config.effort on the Messages API. Documentation: Anthropic effort docs.

What are task budgets in Claude Opus 4.7?

task_budget is a beta primitive on Claude Opus 4.7 that advises the model of a total token target across the entire agentic loop: thinking plus tool calls plus tool results plus final output. The minimum is 20,000 tokens. The model sees a running countdown server-side and paces itself. It is a soft hint, not a hard cap; max_tokens remains the hard ceiling. Beta header task-budgets-2026-03-13 required. Opus 4.7 only. Not supported on Claude Code or Cowork surfaces at launch.

Should I migrate to Claude Opus 4.7?

Not without a rebuilt eval harness. The migration introduces five breaking API changes and five silent behavior changes, including adaptive thinking being off by default and tokenizer inflation compounding cost. For long-trace, tool-heavy workloads, 4.7 is likely to pay for itself. For RAG, classification, single-turn coding, and deep-research agents at 10M-token scaling, it may not. The answer depends on your workload, which your current eval probably cannot measure. Rebuild the eval, then decide.

Your Claude Opus 4.7 reading path from here

This post is the measurement piece of the puzzle. If you want the full operator-model picture, these are the threads to pull next.

Going deeper on routing and model selection. The companion strategic take on Opus 4.7's positioning: Claude Opus 4.7 Is a Split, Not an Upgrade. A framework for deciding which problems deserve Opus-tier reasoning versus Sonnet-tier speed: The Three-Question Filter.

Going deeper on multi-turn agent architecture. What 20 turns of production traffic taught me about the Claude API: What 20-Turn Conversations Taught Me. The four sub-agent orchestration patterns that cover most production Claude workloads: Sub-Agent Orchestration Patterns. The plan-audit-implement-verify cycle that makes long-horizon work trustworthy: The Agentic Development Starter Guide.

Going deeper on evals and production patterns. The five architectural patterns that separate prototype agents from production: Beyond the Wrapper. The governance layer that makes those patterns enforceable across a team: The Engineering Manager's Guide to Agentic Development Governance.

Going deeper on adoption. Why teams buy AI licenses and still have stalled adoption three months later, and what breaks through: Your Team Bought AI Licenses Three Months Ago.

For a diagnostic baseline. If you want to benchmark your organization's AI readiness across five dimensions before deciding where to invest the rebuild energy, the AI Readiness Assessment is ten minutes and produces a personalized action plan. It runs on Sonnet 4.6 with adaptive thinking, not Opus 4.7, and was built as a production example of the kind of workload-shaped eval this post argues for.

If you want a second pair of eyes on a specific production workload, my advisory engagements are the fit. Bring the eval you have, and we will take it apart together.


Want to talk about how this applies to your team?

Book a Free Intro Call

Not ready for a call? Take the free AI Readiness Assessment instead.

Keep reading