Prompt caching

A Claude API feature that stores a repeated context prefix server-side so later requests reuse it at reduced cost and latency, lowering the price of repetitive agentic workflows.

How it works

When a request marks a prefix as cacheable, the API stores it server-side and reuses it on later requests that send the same prefix byte for byte. The cache key is the prefix shape: a single whitespace change, a reordered instruction, or an added comment invalidates the prefix entirely rather than partially, and the next call pays full freight on the whole head again.

Why it matters

Agentic workflows resend a large, mostly fixed context on every turn, so caching the repeated prefix removes most of the per-call cost. The trade-offs are immutability-shaped: the default TTL expires fast enough that a paused session pays full freight on resume, near-identical prompt variants each get their own cache key so each shape pays its own cache-miss cost, and refactoring instructions is expensive because prompt stability and prompt iteration now compete.

In practice

A workflow that prepends the same instructions and reference material to every call benefits from caching exactly as long as the prefix stays byte-identical; edit one instruction mid-session and the next turn pays full freight on the entire reference block again.

Practical considerations

TTL has two documented variants: the 5-minute default refreshes for free on each cache read, and a 1-hour option carries a higher write-time price; long-running agentic loops resumed past the short window pay full freight on the resume, so the choice between them is a workflow-shape decision. A request can mark up to four explicit cache breakpoints, used to cache sections that change at different frequencies and to keep cached prefixes recoverable when a growing conversation pushes the last breakpoint past the cache-hit lookback window. Cache-key sensitivity is byte-level: whitespace, instruction ordering, system-message construction, and tool definitions all participate in the hash, and any change at a layer cascades through the documented hierarchy (tools, then system, then messages). Billing observability lives in two response fields, cache_creation_input_tokens for tokens written and cache_read_input_tokens for tokens reused; without watching both, an author can believe caching is hitting when it is not. A minimum cacheable prefix size threshold varies by model and silently skips the cache below it, no error returned, so a prompt under the threshold simply runs uncached.

Related standards and prior art

Anthropic: prompt caching · continuously updated caches a prompt prefix for reuse on later requests

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in