Claude Routing Has Two Knobs, Not One: A Model x Effort Matrix for 2026

On a per-token basis, an Opus 4.7 cache read costs $0.50 per million tokens, half the $1.00 an uncached Haiku 4.5 read costs, per Anthropic's pricing page, rechecked 2026-06-10. The "cheap model versus flagship" frame that drives most Claude routing decisions falls apart at the workload pattern that anchors most enterprise RAG and customer-support deployments: long stable system prompts hit thousands of times a day.

Routing is also not the one-axis decision the comparison guides describe. Picking a model is half the call. Choosing an effort tier is the other half, and the matrix has structural asymmetries that change the shape of the question. xhigh effort is limited to the top of the lineup: Claude Fable 5 and the Opus flagships, 4.8 and 4.7. Haiku 4.5 does not support the effort parameter at all.

This post is the reference. What the decision surface looks like, what each cell costs, where the routing logic breaks against prompt caching, and how I wire it for production.

Model selection is the wrong question. Choose (model x effort).

The conventional advice is a cost ladder. Haiku for cheap tasks. Sonnet for default work. Opus when something matters. Every comparison guide I read when this post first went up in May (Anthropic's own choosing-a-model docs included, plus CloudHesive's enterprise guide and three SitePoint and Medium roundups) framed it the same way: pick a model, pay the model's tier, ship. Anthropic's docs have since added effort as a fourth selection criterion; the third-party guides mostly have not. The seat tier has the same trap; the true cost of an AI coding tool is mostly the bill below the sticker.

What the cost ladder misses is the second routing axis: Anthropic's effort parameter, five tiers (max, xhigh, high, medium, low) with meaningful behavioral differences per tier. The model picks a capability ceiling. The effort tier picks how much of that ceiling Claude exercises on any given call. From the effort docs: "Effort is a behavioral signal, not a strict token budget." At max, Claude always thinks. At low, it minimizes thinking and skips it on simple queries.

The matrix is structurally asymmetric. Build any routing logic on top of it without learning the asymmetries and you'll pick cells that don't exist.

Model	max	xhigh	high	medium	low
Fable 5	yes	yes	yes	yes	yes
Opus 4.8	yes	yes	yes	yes	yes
Opus 4.7	yes	yes	yes	yes	yes
Opus 4.6	yes	not supported	yes	yes	yes
Sonnet 4.6	yes	not supported	yes	yes	yes
Haiku 4.5	n/a	n/a	n/a	n/a	n/a

At the API level, high is the default. Omit output_config.effort from the call and Fable 5, Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 all run at high. Claude Code overrides this default and ships xhigh for Opus 4.7 specifically; on Fable 5 it stays at high (more on that below).

What about Haiku 4.5? Its row is structurally different: it doesn't accept output_config.effort. Thinking depth on Haiku is controlled the legacy way through budget_tokens. Treating Haiku as "a tier of the same routing system" is a category error.

Because Anthropic flipped its own guidance with this release, the Fable 5 row earns a closer look. On Opus 4.8 the effort docs recommended starting at xhigh for coding and agentic work. For Fable 5 the guidance is the reverse: start at high, step up to xhigh only for the most capability-sensitive workloads, and step down for routine work, because lower effort settings on Fable 5 often exceed xhigh performance on prior models. Three more mechanics travel with the row, per the Fable 5 introduction docs. Adaptive thinking is the only thinking mode and it cannot be turned off; thinking: {type: "disabled"} is not supported. max_tokens is now a hard ceiling on thinking plus response text together, so size the budget for both. And per the models overview, Fable 5 uses the tokenizer introduced with Opus 4.7, so an Opus 4.8 to Fable 5 migration re-prices your tokens without re-inflating the count. (Mythos 5, the classifier-free sibling that shares the xhigh cell, is limited-release only and not on this matrix.)

Four pieces of context travel with each row.

Model	Released	Context / max output	Thinking	Latency
Fable 5	2026-06-09	1M / 128k	adaptive only (always on)	not published
Opus 4.8	2026-05-28	1M / 128k	adaptive only (off by default)	Moderate
Opus 4.7	2026-04-16	1M / 128k	adaptive only (off by default)	Moderate
Opus 4.6	2026-02-05	1M / 128k	adaptive recommended; manual deprecated	Moderate
Sonnet 4.6	2026-02-17	1M / 64k	adaptive recommended	Fast
Haiku 4.5	2025-10-15	200k / 64k	manual via `budget_tokens` only	Fastest

Two implications worth naming. xhigh is limited to Fable 5 and the Opus flagships, so a routing strategy that depends on xhigh specifically lives only on the top tier (the max tier reaches down to Sonnet 4.6, but xhigh does not). And there's a quieter trap I've watched cost a team a week of confusion: Claude Code defaults to different effort levels by model version (high on Opus 4.6, xhigh on Opus 4.7, back to high on Fable 5). Two engineers running "the same work" on two models are not running the same work. More on that in Half Your Team Is on Opus 4.6, Half Is on 4.7. At xhigh on Opus 4.7, adaptive thinking is the only thinking-on mode the API will accept; it is the successor to extended thinking, and Fable 5 and the Opus flagships are the options when reasoning depth is non-negotiable.

What each cell of the matrix costs

The base output rate ratio between Opus 4.7 and Haiku 4.5 is 5x, and even Fable 5 to Haiku 4.5 is only 10x, not 100x. I want to be precise about this because the "one model is two orders of magnitude cheaper than another" framing is everywhere, and it does not match current Anthropic pricing.

Output token rates per million tokens for the models on this matrix. Opus 4.8 Fast Mode (not charted) shares Fable 5's $50 rate.

Source: Anthropic API pricing page, fetched 2026-06-10

That is the standard-rate picture. The full per-cell economics, including cache and batch tiers, are a five-column table.

Model	Input	Cache write (5m / 1h)	Cache read	Output
Fable 5	$10.00	$12.50 / $20.00	$1.00	$50.00
Opus 4.8	$5.00	$6.25 / $10.00	$0.50	$25.00
Opus 4.7	$5.00	$6.25 / $10.00	$0.50	$25.00
Opus 4.6	$5.00	$6.25 / $10.00	$0.50	$25.00
Opus 4.8 Fast Mode	$10.00	*	*	$50.00
Opus 4.6 / 4.7 Fast Mode	$30.00	*	*	$150.00
Sonnet 4.6	$3.00	$3.75 / $6.00	$0.30	$15.00
Haiku 4.5	$1.00	$1.25 / $2.00	$0.10	$5.00

All values $/MTok. Batch API gives a flat 50% discount on input and output for every standard cell, Fable 5 included; the Fast Mode tiers are excluded. The cache multipliers (1.25x for 5m writes, 2x for 1h writes, 0.1x for reads) apply on top of each Fast Mode base rate per Anthropic's pricing page; the asterisks above indicate "compute against the Fast Mode base." Source: Anthropic pricing page, 2026-06-10.

Two amplifiers compound on top of the base rates. First, effort tier changes how many tokens the model consumes on a call: xhigh and max generate more thinking tokens, more tool-call tokens, longer responses. Anthropic does not publish per-tier multipliers ("a behavioral signal, not a strict token budget"), so any "this tier costs Nx that tier" claim is workload-derived, not standard. Second, Opus 4.7 introduced a new tokenizer that consumes up to 1.35x more tokens for the same input text vs. Opus 4.6 and earlier models. Cost goes up even at the same per-MTok price.

Tip

The cheapest cell on the matrix is Haiku 4.5 at batch rates ($0.50 input / $2.50 output per MTok). The most expensive cell is Opus 4.6 / 4.7 Fast Mode ($30 input / $150 output per MTok), a 6x premium for higher output throughput. Fast Mode is a research preview with dedicated rate limits rather than a throughput guarantee, and its pricing forked at Opus 4.8: $10 input / $50 output, 3x cheaper than on 4.6 and 4.7 and the same sticker as standard Fable 5. There is no Fast Mode for Fable 5 at all. Corner to corner, that is a verifiable 60x output ratio. Going further requires unpublished assumptions about token volume per call. The "100x cost spread" you see in routing folklore is not supportable from current Anthropic production pricing.

When the cost-spread argument breaks: prompt caching

Here is the inversion. Cached Opus 4.7 input costs $0.50 per MTok. Uncached Haiku 4.5 input costs $1.00 per MTok. Once the prefix is primed, every cache read is half Haiku's uncached rate; fold in the cache-write premium and the crossover lands at about 12 calls on a stable prefix inside the 5-minute window, or about 20 on the 1-hour tier.

Two Fable 5 footnotes on this math. Its cache read is $1.00 per MTok, exactly Haiku's uncached rate, so at the very top of the matrix caching reaches parity with Haiku rather than undercutting it; the inversion proper runs from the Opus tier down. And Fable 5 lowers the entry bar: its minimum cacheable prompt drops to 512 tokens on the Claude API (1,024 on Bedrock), against floors of 1,024 on Opus 4.8, 2,048 on Opus 4.7, and 4,096 on Opus 4.6 and Haiku 4.5. System prompts too short to cache on the older models qualify on Fable 5.

Before

Routing to Haiku 4.5 for cheap input

$1.00 per MTok uncached input
200k context window
Capability ceiling: 73.3% SWE-bench Verified (Anthropic's eval config: 128k thinking budget)
Re-pays the input cost on every call

After

Caching on Opus 4.7 with a stable prefix

$0.50 per MTok cache read (half the price of Haiku uncached)
1M context window
Capability ceiling: 87.6% SWE-bench Verified
5m default cache window, 1h with the longer cache write tier

Capability ceilings in the card above are per Vellum's Opus 4.7 benchmark extraction and Anthropic's Haiku 4.5 announcement.

This collapses one of the strongest arguments for routing: that you can reach for cheaper models to bring down per-call cost. For cache-heavy workloads (enterprise RAG over a stable corpus, customer support with a fixed system prompt, Claude Code with a stable repository context) the cost-tier intuition flips. The right move can be "cache aggressively on a single high-quality model," not "route every call to the cheapest tier that will tolerate the task."

Where does routing still win? Four conditions.

Variable-prefix workloads. If every request has a different system prompt, caching doesn't amortize. Routing on model and effort is your only cost knob.
Output-heavy generation. Cache pricing applies to input tokens only. A 4k-token response is 5x more expensive on Opus 4.7 ($25/MTok) and 10x more expensive on Fable 5 ($50/MTok) than on Haiku 4.5 ($5/MTok); the output ratio isn't affected by caching.
Latency-bound classification and triage. Haiku is the "Fastest" tier in Anthropic's overview; Opus is "Moderate." For a real-time intent classifier on a customer-support stream, you want Haiku regardless of what the cache math says about cost.
Below ~1,000 requests/day. LogRocket's practitioner analysis is right: at roughly $300/month total LLM spend, the engineering time to build and maintain a routing layer exceeds the savings.

Does Fable 5 support zero data retention? No. Route around it.

No. Claude Fable 5 and its limited-release sibling Mythos 5 are designated Covered Models, Anthropic's label for models whose capability jump triggers stricter data handling. On the Claude API that means a mandatory 30-day minimum retention; zero data retention is not available for either model. Per Anthropic's API and data retention docs, a request to either model from an organization whose retention configuration does not meet that requirement fails with a 400 invalid_request_error. Every other current model, Opus 4.8 on down, stays ZDR-eligible under existing agreements. The constraint is also Claude API-specific: on Bedrock, Vertex AI, and Microsoft Foundry, retention terms are set by the platform, not by the Covered Models policy.

For a routing layer this is a third input, and it behaves differently from the two knobs. Cost and effort trade off; retention gates. A workload bound to a ZDR agreement cannot land on Fable 5 at any price or any effort tier, so its capability ceiling is Opus 4.8 no matter what the rest of the matrix says. Encode the gate before the knobs: tag each task type with its retention class in the same place you tag its turn type, and the 400 never reaches production traffic. Two distinct failure paths need wiring here, and they take different remedies. The retention 400 is preventable and never retryable: pre-route it, because no fallback parameter turns a policy violation into a success. Fable 5's safety-classifier refusals are the retryable case; for those, Anthropic now ships a native fallbacks parameter in beta plus SDK middleware, the fallback path deserves the same care as the capability fallback in the production architecture below, and I walk through the full pattern in refusal handling and model fallback in production. Retention is not the only new routing-relevant behavior in this release either: Fable 5's safety classifiers produce loud refusals in some cases and quietly degraded output in others, which I unpack in the silent degradation post.

How to route in production: task-typing plus per-tier budgets

The architecture I run, and the one I recommend to teams I work with, has three layers.

Classify, don't guess. Every request enters the routing layer with a task type attached. Inside the Ready Solutions Assessment Worker, a Haiku 4.5 call classifies the incoming session into one of four audience archetypes (technical builder, business operator, creative operator, product strategist), and a deterministic routing rule in the Worker tags each request with one of three turn types (continuation, normal, override) straight from the request shape, no model call needed. The archetype classification runs on Haiku because the work is latency-bound triage, and Haiku at the 200-token-prompt scale runs at fractions of a cent per call.

Pick the cell, pin both knobs. Once a task type is classified, the routing layer chooses a (model, effort) pair, never a model alone. The Assessment Worker's production routing.

Turn 0: Classify (Haiku 4.5)

Haiku call detects the audience archetype (technical builder, business operator, creative operator, product strategist); the Worker tags the turn type (continuation, normal, override) deterministically from the request shape. Latency-bound triage at sub-cent cost per call.

Turns 1 to 19: Conversational (Sonnet 4.6 + adaptive thinking)

Mid-tier model with adaptive thinking handles dimension scoring, follow-ups, and steering. Sonnet because the conversation is reasoning-bound but volume is high.

Final turn: Synthesis (Opus 4.7 + xhigh + summarized thinking)

Fires when the assessment is ready, or at the turn-20 hard stop. Flagship at maximum coding-class effort builds the final assessment, max_tokens=96000. Opus + xhigh because the trace is long, the failure cost is high, and this is the only call where it lives.

Fallback path: any 5xx/429/400 (Sonnet 4.6 with scrubbed config)

Drops xhigh to high, drops summarized display, drops 96k to 32k. The synthesis still ships.

That same split shows up at the orchestration level in the AI Persona Profiler: ten or more coordinated Claude Opus instances per pipeline run, with the high-volume stages kept separate from the synthesis stages where coherence matters across long traces. The model and effort choices are explicit at every node, not inherited from a default.

Budget per tier. Each (model, effort) cell gets a monthly cost ceiling. If the Opus 4.7 + xhigh tier is meant to handle 5% of traffic, its budget reflects 5% of expected volume. When the threshold trips, the routing layer either falls back to the next-cheaper cell or surfaces an alert. This is the difference between "we have a routing strategy" and "we have a routing strategy that survives a 10x traffic spike."

Caution

Six routing failure modes I have watched cost teams measurable money, plus a seventh that is new with Fable 5.

All-Opus-everything as default. Pays flagship rates for work Sonnet at medium handles cleanly.
Opus + low effort. Paying for the highest capability ceiling, then telling Claude not to use it. Use Sonnet at high instead.
Haiku for hard reasoning. Capability ceiling is the wrong shape. The work loops or fails silently.
Sonnet without adaptive thinking on hard tasks. Cheap mode applied to expensive work, with no signal that thinking would have helped.
Opus 4.6 Fast Mode by default. Six times the standard rate for higher output throughput, used in places where standard latency was already fine.
Model set without effort pinned. The SDK or the IDE picks an effort tier you did not choose. Same model name, different bill.
A ZDR-bound workload routed to Fable 5. The call fails with a 400 before any tokens flow. Retention eligibility belongs in the routing layer, not in the incident review.

The most expensive of those, in my experience, is number six. Setting model without pinning effort. Claude Code's defaults differ across Opus versions. Every internal SDK wrapper I've audited (across consulting clients and my own day job) stores the model string in config but inherits the effort tier from whatever the SDK ships. A team running "Opus 4.7 for the hard stuff" is silently running it at xhigh on Claude Code (highest cost tier) and at high on direct API calls (one tier lower). Same model name. Different bill. Different latency. Different bench number. Claude Fable 5 made that trap dearer the day it launched: the vendor's own effort guidance flipped from xhigh to high on the new model, so an inherited xhigh default keeps burning xhigh-scale token volume at Fable's doubled output rate ($50 versus $25 per MTok), spend the docs now tell you to opt into deliberately rather than inherit. Fable 5 is also the first cell on the matrix with a compliance gate in front of it. If the workload is ZDR-bound, that cell doesn't exist for you.

Pick the cell, pin the cell

Routing is a two-knob decision with a retention gate in front of it. Check the gate, then set both knobs intentionally. That is the architecture.

If your team is wiring or rewiring its Claude routing layer, the work I do as part of Claude API Development and Support covers exactly this: auditing where each request lands on the matrix, designing the classifier, and pinning the per-tier budgets. Schedule a 15-minute call and we will walk through your current routing decisions together.

For deeper reading on the same axes: the Opus 4.7 vs 4.6 effort ladder walks through where xhigh is worth the tokens, Claude Opus 4.7 Is a Split, Not an Upgrade is the predecessor argument this post extends, and What 20-Turn Conversations Taught Me About the Claude API is where the turn-typed routing pattern came from.

Glossary terms used

Extended thinking Prompt caching Adaptive thinking Zero data retention Model fallback Model refusal

Claude Routing Has Two Knobs, Not One: A Model x Effort Matrix for 2026

Model selection is the wrong question. Choose (model x effort).

What each cell of the matrix costs

When the cost-spread argument breaks: prompt caching

Routing to Haiku 4.5 for cheap input

Caching on Opus 4.7 with a stable prefix

Does Fable 5 support zero data retention? No. Route around it.

How to route in production: task-typing plus per-tier budgets

Turn 0: Classify (Haiku 4.5)

Turns 1 to 19: Conversational (Sonnet 4.6 + adaptive thinking)

Final turn: Synthesis (Opus 4.7 + xhigh + summarized thinking)

Fallback path: any 5xx/429/400 (Sonnet 4.6 with scrubbed config)

Pick the cell, pin the cell

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agentic AI Governance in Production: Who Owns the Bar When the Agent Ships

Running Claude Code as a Production Engineering Practice

Continue reading: more in Lead with Claude

Claude's New Agent SDK Credit Pool Isn't a Price Hike for Most of You

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Sources

Model selection is the wrong question. Choose (model x effort).

What each cell of the matrix costs

When the cost-spread argument breaks: prompt caching

Routing to Haiku 4.5 for cheap input

Caching on Opus 4.7 with a stable prefix

Does Fable 5 support zero data retention? No. Route around it.

How to route in production: task-typing plus per-tier budgets

Turn 0: Classify (Haiku 4.5)

Turns 1 to 19: Conversational (Sonnet 4.6 + adaptive thinking)

Final turn: Synthesis (Opus 4.7 + xhigh + summarized thinking)

Fallback path: any 5xx/429/400 (Sonnet 4.6 with scrubbed config)

Pick the cell, pin the cell

Reference guides for this topic

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agentic AI Governance in Production: Who Owns the Bar When the Agent Ships

Running Claude Code as a Production Engineering Practice

Continue reading: more in Lead with Claude→

Claude's New Agent SDK Credit Pool Isn't a Price Hike for Most of You

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Sources

Continue reading: more in Lead with Claude