Extended thinking

A Claude capability that allocates the model a token budget for an internal reasoning phase before its final answer, trading speed and cost for deeper decomposition of complex tasks.

How it works

The capability gives the model a separate token budget to reason internally before it commits to a final answer; those thinking tokens count against billing but are not returned to the caller. A larger budget buys deeper decomposition of a hard problem and a smaller one returns faster, so the knob trades latency and cost against reasoning depth.

Why it matters

The budget is a soft cap, not a quality floor: spending more does not guarantee a proportionally better answer, and for tasks the model handles correctly on reflex, the extra latency and cost buy nothing measurable. Tuning is empirical with no model-card formula, so knowing when to spend the thinking budget is its own skill that competes with prompt engineering for the workflow owner's attention.

In practice

A routine edit runs with a small thinking budget for a fast answer; a multi-file refactor is given a larger budget so the model can work through the dependencies before responding. The caller never sees the deliberation tokens but pays for them on both calls.

Practical considerations

On Claude Opus 4.7 and later the thinking request shape changed: the older budget_tokens parameter returns HTTP 400, adaptive thinking is the supported thinking mode, and depth is steered by an effort parameter rather than a token budget. On Opus 4.7 and 4.8 adaptive thinking is opt-in, so a request that previously ran with extended thinking under the older shape runs with no thinking at all unless adaptive thinking is set explicitly; on the Fable-class generation that followed, adaptive thinking is always on and cannot be disabled, so the migration question inverts from remembering to enable thinking to budgeting for thinking the model may now choose to do on any request. With adaptive thinking, max_tokens covers thinking plus response text together, so a ceiling sized for response text alone can end a call mid-reasoning. Interleaved thinking is automatic with adaptive thinking and lets the model reason between tool calls and after tool results rather than during a thinking block itself, which changes how tool-heavy workflows are structured. The response display defaults to omitting the thinking content from the response body while still charging for the full thinking tokens generated, so the billing line and the visible-output line decouple by default. The current-generation control surface, model-decided depth steered by effort, is treated in depth by the adaptive thinking term.

Related standards and prior art

Anthropic: extended thinking · continuously updated allocates a thinking token budget before the final response

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in