On a per-token basis, an Opus 4.7 cache read costs $0.50 per million tokens, half the $1.00 an uncached Haiku 4.5 read costs, per Anthropic's pricing page on 2026-05-08. The "cheap model versus flagship" frame that drives most Claude routing decisions falls apart at the workload pattern that anchors most enterprise RAG and customer-support deployments: long stable system prompts hit thousands of times a day.
Routing is also not the one-axis decision the comparison guides describe. Picking a model is half the call. Choosing an effort tier is the other half, and the matrix has structural asymmetries that change the shape of the question. xhigh effort is Opus 4.7 only. Haiku 4.5 does not support the effort parameter at all.
This post is the reference. What the decision surface looks like, what each cell costs, where the routing logic breaks against prompt caching, and how I wire it for production.
Model selection is the wrong question. Choose (model x effort).
The conventional advice is a cost ladder. Haiku for cheap tasks. Sonnet for default work. Opus when something matters. Every comparison guide I read while researching this post (Anthropic's own choosing-a-model docs, CloudHesive's enterprise guide, three SitePoint and Medium roundups) frames it the same way: pick a model, pay the model's tier, ship. The seat tier has the same trap; the true cost of an AI coding tool is mostly the bill below the sticker.
Anthropic's effort parameter is a second routing axis with five tiers (max, xhigh, high, medium, low) and meaningful behavioral differences per tier. The model picks a capability ceiling. The effort tier picks how much of that ceiling Claude exercises on any given call. From the effort docs: "Effort is a behavioral signal, not a strict token budget." At max, Claude always thinks. At low, it minimizes thinking and skips it on simple queries.
The matrix is structurally asymmetric. Build any routing logic on top of it without learning the asymmetries and you will pick cells that do not exist.
| Model | max | xhigh | medium | low |
|---|---|---|---|---|
| Opus 4.7 | yes | yes (only model) | yes | yes |
| Opus 4.6 | yes | not supported | yes | yes |
| Sonnet 4.6 | yes | not supported | yes | yes |
| Haiku 4.5 | n/a | n/a | n/a | n/a |
At the API level, high is the default. Omit output_config.effort from the call and Opus 4.7, Opus 4.6, and Sonnet 4.6 all run at high. Claude Code overrides this default and ships xhigh for Opus 4.7 specifically (more on that below).
Haiku 4.5 row is structurally different: it does not accept output_config.effort. Thinking depth on Haiku is controlled the legacy way through budget_tokens. Treating Haiku as "a tier of the same routing system" is a category error.
Four pieces of context travel with each row.
| Model | Released | Context / max output | Thinking | Latency |
|---|---|---|---|---|
| Opus 4.7 | 2026-04-16 | 1M / 128k | adaptive only (off by default) | Moderate |
| Opus 4.6 | 2026-02-05 | 1M / 128k | adaptive recommended; manual deprecated | Moderate |
| Sonnet 4.6 | 2026-02-17 | 1M / 64k | adaptive recommended | Fast |
| Haiku 4.5 | 2025-10-15 | 200k / 64k | manual via budget_tokens only | Fastest |
Two implications worth naming. xhigh is Opus 4.7 only, so any routing strategy that relies on "cheap deep thinking" lives only on the new flagship. And there is a quieter trap I have watched cost a team a week of confusion: Claude Code defaults to different effort levels by model version (high on Opus 4.6, xhigh on Opus 4.7). Two engineers running "the same work" on the two models are not running the same work. More on that in Half Your Team Is on Opus 4.6, Half Is on 4.7.
What each cell of the matrix costs
The base output rate ratio between Opus 4.7 and Haiku 4.5 is 5x, not 100x. I want to be precise about this because the "one model is two orders of magnitude cheaper than another" framing is everywhere, and it does not match current Anthropic pricing.
That is the standard-rate picture. The full per-cell economics, including cache and batch tiers, are a five-column table.
| Model | Input | Cache write (5m / 1h) | Cache read | Output |
|---|---|---|---|---|
| Opus 4.7 | $5.00 | $6.25 / $10.00 | $0.50 | $25.00 |
| Opus 4.6 | $5.00 | $6.25 / $10.00 | $0.50 | $25.00 |
| Opus 4.6 Fast Mode | $30.00 | * | * | $150.00 |
| Sonnet 4.6 | $3.00 | $3.75 / $6.00 | $0.30 | $15.00 |
| Haiku 4.5 | $1.00 | $1.25 / $2.00 | $0.10 | $5.00 |
All values $/MTok. Batch API gives a flat 50% discount on input and output for every model except Opus 4.6 Fast Mode. The cache multipliers (1.25x for 5m writes, 2x for 1h writes, 0.1x for reads) apply on top of Fast Mode's 6x premium per Anthropic's pricing page; the asterisks above indicate "compute against the Fast Mode base." Source: Anthropic pricing page, 2026-05-08.
Two amplifiers compound on top of the base rates. First, effort tier changes how many tokens the model consumes on a call: xhigh and max generate more thinking tokens, more tool-call tokens, longer responses. Anthropic does not publish per-tier multipliers ("a behavioral signal, not a strict token budget"), so any "this tier costs Nx that tier" claim is workload-derived, not standard. Second, Opus 4.7 uses a new tokenizer that consumes up to 1.35x more tokens for the same input text vs. Opus 4.6 and earlier models. Cost goes up even at the same per-MTok price.
The cheapest cell on the matrix is Haiku 4.5 at batch rates ($0.50 input / $2.50 output per MTok). The most expensive standard cell is Opus 4.6 Fast Mode ($30 input / $150 output per MTok), a 6x premium for guaranteed throughput. That is a verifiable 60x output ratio between corners. Going further requires unpublished assumptions about token volume per call. The "100x cost spread" you see in routing folklore is not supportable from current Anthropic production pricing.
When the cost-spread argument breaks: prompt caching
Here is the inversion. Cached Opus 4.7 input costs $0.50 per MTok. Uncached Haiku 4.5 input costs $1.00 per MTok. For workloads with stable prefixes hit ten or more times within the cache window (5 minutes default, 1 hour with the longer cache write), reading from a primed Opus cache is cheaper per input token than calling Haiku from scratch.
Routing to Haiku 4.5 for cheap input
- $1.00 per MTok uncached input
- 200k context window
- Capability ceiling: 73.3% SWE-bench Verified at 128k thinking budget
- Re-pays the input cost on every call
Caching on Opus 4.7 with a stable prefix
- $0.50 per MTok cache read (half the price of Haiku uncached)
- 1M context window
- Capability ceiling: 87.6% SWE-bench Verified
- 5m default cache window, 1h with the longer cache write tier
This collapses one of the strongest arguments for routing: that you can reach for cheaper models to bring down per-call cost. For cache-heavy workloads (enterprise RAG over a stable corpus, customer support with a fixed system prompt, Claude Code with a stable repository context) the cost-tier intuition flips. The right move can be "cache aggressively on a single high-quality model," not "route every call to the cheapest tier that will tolerate the task."
Where does routing still win? Four conditions.
- Variable-prefix workloads. If every request has a different system prompt, caching doesn't amortize. Routing on model and effort is your only cost knob.
- Output-heavy generation. Cache pricing applies to input tokens only. A 4k-token response is 5x more expensive on Opus 4.7 ($25/MTok) than on Haiku 4.5 ($5/MTok); the output ratio isn't affected by caching.
- Latency-bound classification and triage. Haiku is the "Fastest" tier in Anthropic's overview; Opus is "Moderate." For a real-time intent classifier on a customer-support stream, you want Haiku regardless of what the cache math says about cost.
- Below ~1,000 requests/day. LogRocket's practitioner analysis is right: at roughly $300/month total LLM spend, the engineering time to build and maintain a routing layer exceeds the savings.
How to route in production: task-typing plus per-tier budgets
The architecture I run, and the one I recommend to teams I work with, has three layers.
Classify, don't guess. Every request enters the routing layer with a task type attached. Inside the Ready Solutions Assessment Worker, a Haiku 4.5 call classifies the incoming session into one of four audience archetypes (technical builder, business operator, creative operator, product strategist), and a separate routing rule tags each request with one of three turn types (continuation, normal, override) for effort tuning. Both classifications happen on Haiku because the work is latency-bound triage, and Haiku at the 200-token-prompt scale runs at fractions of a cent per call.
Pick the cell, pin both knobs. Once a task type is classified, the routing layer chooses a (model, effort) pair, never a model alone. The Assessment Worker's production routing.
Turn 0: Classify (Haiku 4.5)
Haiku call detects audience archetype (technical builder, business operator, creative operator, product strategist) and a turn-type tag (continuation, normal, override). Latency-bound triage at sub-cent cost per call.
Turns 1 to 19: Conversational (Sonnet 4.6 + adaptive thinking)
Mid-tier model with adaptive thinking handles dimension scoring, follow-ups, and steering. Sonnet because the conversation is reasoning-bound but volume is high.
Turn 20: Synthesis (Opus 4.7 + xhigh + summarized thinking)
Flagship at maximum coding-class effort builds the final assessment, max_tokens=96000. Opus + xhigh because the trace is long, the failure cost is high, and this is the only call where it lives.
Fallback path: any 5xx/429/400 (Sonnet 4.6 with scrubbed config)
Drops xhigh to high, drops summarized display, drops 96k to 32k. The synthesis still ships.
That same architecture pattern shows up in the AI Persona Profiler: Haiku 4.5 for fast classification calls, Opus 4.7 for synthesis where coherence matters across long traces. Ten or more coordinated Opus instances per pipeline run. The model and effort choices are explicit at every node, not inherited from a default.
Budget per tier. Each (model, effort) cell gets a monthly cost ceiling. If the Opus 4.7 + xhigh tier is meant to handle 5% of traffic, its budget reflects 5% of expected volume. When the threshold trips, the routing layer either falls back to the next-cheaper cell or surfaces an alert. This is the difference between "we have a routing strategy" and "we have a routing strategy that survives a 10x traffic spike."
Six routing failure modes I have watched cost teams measurable money.
- All-Opus-everything as default. Pays flagship rates for work Sonnet at medium handles cleanly.
- Opus + low effort. Paying for the highest capability ceiling, then telling Claude not to use it. Use Sonnet at high instead.
- Haiku for hard reasoning. Capability ceiling is the wrong shape. The work loops or fails silently.
- Sonnet without adaptive thinking on hard tasks. Cheap mode applied to expensive work, with no signal that thinking would have helped.
- Opus 4.6 Fast Mode by default. Six times the standard rate for guaranteed throughput, used in places where standard latency was already fine.
- Model set without effort pinned. The SDK or the IDE picks an effort tier you did not choose. Same model name, different bill.
The most expensive of those, in my experience, is the last one. Setting model without pinning effort. Claude Code's defaults differ across Opus versions. Every internal SDK wrapper I have audited (across consulting clients and my own day job) stores the model string in config but inherits the effort tier from whatever the SDK ships. A team running "Opus 4.7 for the hard stuff" is silently running it at xhigh on Claude Code (highest cost tier) and at high on direct API calls (one tier lower). Same model name. Different bill. Different latency. Different bench number.
Pick the cell, pin the cell
Routing is a two-knob decision. Set both knobs intentionally. That is the architecture.
If your team is wiring or rewiring its Claude routing layer, the work I do as part of Claude API Development and Support covers exactly this: auditing where each request lands on the matrix, designing the classifier, and pinning the per-tier budgets. Schedule a 15-minute call and we will walk through your current routing decisions together.
For deeper reading on the same axes: the Opus 4.7 vs 4.6 effort ladder walks through where xhigh is worth the tokens, Claude Opus 4.7 Is a Split, Not an Upgrade is the predecessor argument this post extends, and What 20-Turn Conversations Taught Me About the Claude API is where the turn-typed routing pattern came from.