Adaptive thinking

Adaptive thinking is a Claude thinking mode in which the model decides for itself whether and how deeply to reason before answering each request, steered by a coarse effort setting rather than a caller-set thinking-token budget, with the thinking it chooses to do billed whether or not it is displayed.

How it works

Adaptive thinking is a request mode in which the model evaluates each request and decides whether to think at all and how deeply, instead of working against the older control, a fixed reasoning-token budget the caller reserved per request in advance. The caller steers that decision with a coarse effort setting, a small ladder of named levels on the request: at higher effort the model reasons on almost every request, at lower effort it skips thinking on work it judges simple. A single max_tokens ceiling covers thinking plus response text together, so the deliberation and the answer share one budget. With adaptive thinking the model can also reason between tool calls and after tool results, not only in an up-front block, which suits agentic work where the hard decision arrives mid-task. The raw chain of thought is not returned on the newest models; display defaults to omitting it, a summary can be requested, and billing covers the full thinking trace either way. On the Fable-class model generation the mode is always on, and a request that tries to disable thinking is rejected.

Why it matters

Handing the depth decision to the model changes what a request costs from a deterministic function of the payload into a behavior: the same call can bill differently depending on what the model decides it needs, which is why vendor migration guidance says to re-baseline cost on real traffic rather than project from rate cards. The case for the trade is that on mixed workloads where difficulty is hard to predict, the model's per-request judgment is better calibrated than a static budget a caller picks once, so simple work stops overpaying for reflection it never needed and hard work stops being starved. What the caller gives up is per-request determinism, and the practical casualties are tight ceilings: a max_tokens that comfortably fit response text alone can now end a call mid-reasoning. Which requests trigger thinking is not part of the contract, and it can shift between model versions, so an assumption baked in during one generation deserves re-validation on the next. The deeper discipline, deciding when reasoning is worth spending for at all, stays with the caller through the effort lever; adaptive thinking moves the fine-grained decision, not the accountability for spend.

In practice

A workload migrates to a model generation where adaptive thinking is always on. The requests are unchanged and the outputs look the same, but spend rises, because calls that previously ran thought-free now reason whenever the model judges it worthwhile. The team finds it not in error logs but on the usage report's thinking-token line, then steps effort down on the routes where quality holds. The fix was a routing decision, not a prompt edit, which is the shape of cost work under adaptive thinking.

Practical considerations

Treat effort as the primary tuning lever: start at the vendor's recommended default for the model generation and step down while quality holds, since the recommended starting point itself differs by generation. Size max_tokens for thinking plus response on adaptive models; a ceiling carried over from a response-only era surfaces as calls ending early with a max-tokens stop. The older budget-token request shape returns an error on Claude Opus 4.7 and every generation since, a contract break worth scanning for during migration rather than discovering in production. Display and billing decouple by default: the response can omit thinking content while the bill covers the full trace, so usage accounting, not the visible output, is the ground truth for spend. On models where the mode cannot be disabled, a workload that depended on thinking-free calls re-baselines instead of opting out. The response's usage accounting reports thinking tokens as a separately itemized share of the billed output total, which is what makes that re-baseline measurable.

Related standards and prior art

Anthropic: adaptive thinking · continuously updated defines the adaptive thinking mode, the effort setting that steers it, and max_tokens as the single ceiling over thinking plus response
Anthropic: introducing Claude Fable 5 and Claude Mythos 5 · 2026-06-09 documents the generation on which adaptive thinking is always on and a request disabling thinking is rejected
OpenAI: reasoning models · continuously updated cross-vendor parallel: a developer-set reasoning effort control and billed-but-not-returned reasoning tokens under different vocabulary
Google: Gemini thinking · continuously updated cross-vendor parallel: thinking levels and a dynamic budget setting under which the model adjusts reasoning depth itself

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in