What is the cheapest way to cut agent token costs without losing quality?

Work the levers in order of risk, not size. Effort tuning and cache discipline change spend without changing which model reasons about your task; context pruning removes tool-result bytes that measured studies show are largely removable with no performance loss; model downgrades and fan-out reductions change capability and should come last, validated against your own task suite rather than a benchmark average.

Is prompt caching worth it for agents?

When a large, stable prefix is reused before the cache expires, yes, and the arithmetic is short. A five-minute cache write costs a 25 percent premium over base input and pays for itself on the first read, since reads bill at 10 percent of the base input rate. A one-hour write costs double base input and pays off on the second read. The catch is immutability: a cache hit requires a byte-identical prefix, so the real engineering work is keeping tools, system prompt, and shared context stable rather than sprinkling cache markers on a prompt that churns.

How do I measure AI ROI for coding agents?

Attribute spend first, then attach it to outcomes. The telemetry your tools emit is explicitly an estimate; the metered record is the vendor's usage and cost reporting, reconciled per workspace, model, and API key. Once spend is attributed, measure cost per resolved task or per shipped change rather than cost per token, and treat perception as untrustworthy: the best-known randomized trial found experienced developers were 19 percent slower with AI assistance while believing they were about 20 percent faster, a 2025 result METR has since labeled outdated as it redesigns the study, though the perception-gap lesson stands.

Cornerstone Guide

Cost-Aware Agentic Design in Production: The Bill Is a Design Readout, Not a Meter

Q: How much more does an agentic workload cost than a chat workload?

Anthropic's own engineering data puts a single agent at about four times the tokens of a chat interaction and a multi-agent system at about 15 times, and its Claude Code documentation adds that agent teams run about seven times the tokens of a standard session. The driver is structural: every tool result is re-sent with every subsequent request, so a long loop accumulates context quadratically rather than linearly, and each subagent runs its own full context window.

Token spend in an agentic system is set by five design choices: model tier, effort, cache discipline, context shape, and fan-out width. This guide treats cost as an architecture property, covering the June 2026 pricing mechanics, the lever order that preserves quality, the attribution stack that separates telemetry estimates from the metered invoice, and the ROI arithmetic that only works once spend is attributed to outcomes.

Last reviewed June 11, 2026

Agentic cost control Model effort matrix Prompt caching Adaptive thinking Extended thinking Subagent orchestration Claude Code subagent Agent observability AI ROI measurement Claude Code Context window Context engineering Zero data retention

This spring Anthropic roughly doubled its published estimate of what a Claude Code developer costs on an active day, from about $6 to about $13, and said there was no pricing or product change behind the move. The per-token rates were identical before and after. What changed was the usage underneath: the live cost page now reports an average of about $13 per developer per active day and $150 to $250 per developer per month, because real sessions had grown longer, deeper in tool calls, and wider in agent fan-out than the old estimate assumed. That is this guide's whole argument compressed into one billing footnote. In agentic systems, cost is a design property. The meter only reads it back.

I run a small production demonstration of this every day. The AI readiness assessment on this site is a roughly 20-turn agent behind a Cloudflare Worker, and its cost shape was fixed before the first visitor arrived: a Haiku-tier model classifies the opening turn, a Sonnet-tier model carries the conversation, and an Opus-tier model at xhigh effort runs exactly one call, the final synthesis, against a system prefix that sits on a one-hour prompt cache. Nothing about that design is exotic. It's the same five levers every agentic system has, whether anyone is holding them or not: model tier, effort, cache discipline, context shape, and fan-out width. Set them deliberately and cost-per-task is a number you chose. Leave them at defaults and it's a number that happens to you.

One definition before the anatomy, because the phrase invites a misreading. Cost-aware design is not token frugality. It is two disciplines: attribution, knowing what each task costs and which design choice made it cost that, and deliberate lever-setting, choosing model, effort, cache, context, and fan-out on purpose. A team that has both can rationally decide to spend more, and often should. What no team can do without them is claim ROI, because ROI is a ratio and unattributed spend has no denominator. I covered the session-level version of this argument in the token-spend post and the procurement-level version in the true-cost TCO piece; this guide is the system-design layer underneath both.

Why does an agent cost so much more than a chat?

Because an agent re-reads its own history on every step, and billing follows the re-reading. Each turn of an agent loop sends the full accumulated context, system prompt, tool definitions, every prior tool result, back through the API as input tokens. Anthropic's multi-agent engineering write-up measured the consequence: agents typically use about four times the tokens of a chat interaction, and multi-agent systems about 15 times, with token usage alone explaining 80 percent of the performance variance on their research tasks. The numbers date from June 2025 and remain the canonical vendor measurement; nothing fresher has replaced them, which is itself worth knowing when you see them quoted as current.

The accumulation is not linear, and this is the part per-step estimates miss. Because step N re-sends the output of every step before it, a loop's cumulative input grows roughly with the square of its length. Augment Code's loop-cost analysis works the arithmetic: a 20-step loop generating about a thousand tokens per step produces around 210,000 cumulative input tokens, an order of magnitude more than the 20,000 a naive estimate suggests. The same analysis measured where the bytes live: in one traced run, roughly 30,000 of 48,000 total tokens were tool results, and about 40 to 60 percent of those were removable with no performance loss. Most of an agent's bill is the agent re-reading things it no longer needs. The profile does vary by workload class: a tool-heavy loop re-feeding large results accumulates far faster than a short reasoning-only chain, which is why your own traces, not anyone's published multiplier, should size your budget.

Fan-out multiplies all of it, because each subagent carries its own full context window. Anthropic's Claude Code cost documentation puts agent teams at about seven times the tokens of a standard session, with teammates running in plan mode, for exactly this reason. On top of the loop mechanics, adaptive thinking bills its reasoning between tool calls as output tokens whether or not you ever see the thinking text, and every tool definition rides along as input on every request. None of these mechanics is a defect. Re-reading context is what makes an agent coherent across steps; isolation per worker is what makes orchestration reliable, a trade-off I map in the subagent orchestration guide. The point is narrower: some of these multipliers are provider mechanics you steer in degree rather than kind, but how often each one fires, how much context it carries, and whether the reliability it buys is worth the spend are design choices, and that's leverage enough.

What does the pricing surface charge for in June 2026?

The rates themselves are the easy part. Anthropic's pricing page currently lists, per million tokens: Haiku 4.5 at $1 in and $5 out, Sonnet 4.6 at $3 in and $15 out, Opus 4.8 at $5 in and $25 out, and Claude Fable 5 at $10 in and $50 out. Two structural facts matter more than any single number. Output costs five times input on every current tier, which is why thinking-heavy and verbose-output workloads dominate bills that input math says should be small. And the spread from the cheapest current model to the most capable one is ten to one on both input and output, which is the entire economic case for routing different work to different tiers.

The mechanics around those rates are where design leverage lives. Cache writes carry a 25 percent premium over base input on the five-minute tier and a 100 percent premium on the one-hour tier, while cache reads bill at 10 percent of base input. That makes the five-minute cache profitable after a single read and the one-hour cache after two, and it makes cache reads effectively invisible to rate limits, since the rate-limit documentation excludes cache reads from input-token-per-minute accounting on current models. The Message Batches API halves both input and output for work that can tolerate asynchronous completion, and the discounts stack: a cached read inside a batch job bills at five percent of base input. Thinking is billed as output at full rate regardless of whether the response displays it, per the adaptive thinking documentation, and on the newest models, Claude Fable 5 first among them, thinking cannot be disabled, only steered with the effort parameter.

Two quieter line items deserve a place in any cost model. The tokenizer that arrived with Opus 4.7 and carried forward since can produce up to 1.35x as many tokens for the same text as earlier models, so a cost baseline built before that change can understate current per-request cost for affected text, even at identical rates; I walked through the practical impact in the effort-ladder post. And on current-generation models (Opus 4.6, Sonnet 4.6, and later), regional inference adds a 10 percent surcharge across every token category when you pin US-only serving, which matters for exactly the regulated workloads that also carry zero-data-retention constraints. A routing layer that respects the model-by-effort matrix should encode retention and residency alongside price, as I argued in the model-effort matrix post. All of these numbers live on continuously maintained vendor pages and change without ceremony; this page is verified against them rather than against a snapshot, and the sources note says when. The reason to internalize the mechanics rather than bookmark them is the thesis itself: the ten-to-one tier spread, tenth-rate cache reads, and half-rate batches are the prices attached to the five levers, which makes the next question the practical one. In what order do you pull them?

Which lever do you pull first: effort, model, cache, context, or fan-out?

Anthropic's model-selection guidance names the ordering principle directly:

Tuning effort is often a better lever than switching models.

I order the five levers by risk to output quality, cheapest risk first. Treat the order as a default for tool-heavy agent loops, not a law: a high-volume classification service tunes model tier first, a runaway parallel pipeline reins in fan-out first, and a session whose prefix churns may never reach cache breakeven at all. Effort comes first because it changes how much the model deliberates without changing which model deliberates. The effort documentation is unusually direct about the top end: on most workloads, max effort adds significant cost for relatively small quality gains, and lower effort levels also act differently in agent loops, combining operations into fewer tool calls and proceeding to action with less preamble. Fewer tool calls means fewer re-fed tool results, so effort tuning compounds through the loop mechanics from the first section in a way single-call analysis never shows.

Cache discipline is second, and it's a contract rather than a feature flag. A cache hit requires a byte-identical prefix, so the engineering work is immutability: stable tool definitions, a stable system prompt, dynamic content pushed to the end of the request. When a workload's prefix genuinely must change mid-session, the honest answer is to cache the layers that hold still and re-check whether the breakeven still clears. Done well, the payoff is measured and large; a peer-reviewed evaluation of prompt caching for long-horizon agent tasks found API cost reductions of 41 to 80 percent across three providers, alongside the caveats that naive full-context caching can backfire and that short, shapeshifting sessions see little benefit. My assessment worker keeps its system prompt, archetype module, and tool block on a one-hour cache precisely because those bytes never change between visitors; that one decision is the difference between re-paying for the full prefix twenty times per assessment and paying 10 percent of base input for it nineteen times out of twenty.

Context shape is third. The measured runs above say 40 to 60 percent of tool-result tokens carry nothing the task still needs, so pruning, summarizing, and compacting are pure savings until they remove a fact the task still needs for tool choice, correctness, or auditability. The discipline of knowing where that line sits, and proving a compacted trace still passes your task-class checks, is context engineering, which has its own guide.

Model tier is fourth, not first, because it's the lever that changes capability. The vendor's published matrix points Haiku-tier models at sub-agent tasks and real-time work while reserving Opus-tier and Fable-tier models for long-horizon, high-autonomy work, and that asymmetry, cheap workers under a capable top tier, is the architecture my own assessment worker runs: Haiku classifies the first turn, Sonnet carries nineteen conversational turns, and one Opus call at xhigh effort synthesizes the result. And be more skeptical of automated routing than its literature suggests. A January 2026 meta-benchmark across 400,000 instances and 33 models found that many routing methods, commercial offerings included, fail to reliably outperform a simple baseline, with a persistent gap to optimal routing driven by recall failures: the router not knowing what the stronger model would have done.

Fan-out width comes last because it is the most expensive lever and the easiest to pull reflexively. The multiplier evidence says a multi-agent shape starts around 15 times chat-level spend before it has done anything a single loop could not, so the honest sequencing is the one I laid out in the orchestration-patterns post: reach for fan-out when a single flat loop has measurably failed, not when the architecture diagram looks better with more boxes. Parallel dispatch buys wall-clock speed, roughly threefold in my own banked probes, and multiplies token spend in the same stroke. Whether that trade pays depends entirely on whether the task values latency over budget, which is a product decision, not a model one.

How do you attribute spend you cannot see?

Everything above sets cost; this section is about knowing what it was, and the trap is subtle enough that Anthropic's own field naming warns about it. The cost figures your tooling shows you in-session are estimates. Claude Code's monitoring documentation labels the per-request cost field in its telemetry events an estimated cost in USD, and nothing in the telemetry pipeline reads your invoice back from the billing system. The metered record lives one layer down: the Usage and Cost Admin API reports token counts at minute, hour, or day granularity, filterable and groupable by model, workspace, API key, and service tier, plus a daily cost report in actual dollars. Two access notes before you plan around it: the endpoints need an organization admin key, not a standard API key, and they meter the Claude API directly, so a deployment running through a cloud marketplace reconciles against the cloud provider's bill instead. Telemetry is for fast relative signal, the cost report is for accounting, and a monthly reconciliation between the two is the cheapest audit you will ever run. The mechanics fit in two sentences: pull the daily cost report grouped by workspace (model arrives as a parsed field on the report's description rows, not as a direct grouping), then compare it against your telemetry's cost totals summed over the same window. After normalizing for credits, settlement lag, and reporting-window boundaries, whatever divergence remains is where your attribution story has a hole. I treat this split as one instance of a general rule I wrote up in the observability post: an agent's account of what it did, including what it spent, is a claim, and observability means instrumenting the layer that cannot be wrong about itself.

The attribution half is less mature than the measurement half. The OpenTelemetry GenAI semantic conventions define the token-usage attributes a cross-vendor pipeline wants, input, output, cache-read, cache-creation, and reasoning token counts per span, but the conventions remain in development stability as of June 2026 and define no cost attributes at all: the standard carries token counts and leaves dollar attribution to you. Claude Code's telemetry goes further than the standard, tagging each cost and token event with the session, user, model, and a query-source attribute that distinguishes the main loop from subagent and auxiliary calls, which is exactly the dimension you need to learn that a third of your spend is one subagent re-reading a large file. For the organizational layer, the FinOps Foundation's AI cost guidance is the right vocabulary: unit economics framed as cost per business outcome rather than cost per token, showback before chargeback (report each team's spend before you bill them for it), and an honest acknowledgment that attributing one model's output across multiple consuming agents is a problem the discipline has not standardized yet.

What I hold as the working standard: every agentic system in production should be able to answer, within a day, what does one completed task cost, broken down by model tier and by main-loop versus fan-out spend. Not because the number must be small, but because every lever in the previous section is invisible without it. The teams in the worst position this year are not the ones spending the most; they are the ones who cannot say which workload class the spend belongs to.

If token prices fall 50-fold a year, why design for cost at all?

This is the strongest objection to everything above, and it deserves its full weight. Epoch AI's price-trend measurement found LLM inference prices for constant capability falling at a median of about 50-fold per year, accelerating to roughly 200-fold per year after early 2024. On that curve, engineering effort spent shaving token spend is effort spent optimizing a depreciating input, while the routing logic, cache plumbing, and pruning heuristics you built to do it stay in your codebase as standing complexity. Nvidia's Jensen Huang gave the position its sharpest form in March 2026, telling engineers they should consume tokens worth half their salary: tokens as capital input, not cost to minimize. And the productivity data gives the position real teeth. In TechCrunch's June 2026 reporting on enterprise token spend, workforce-analytics firm Jellyfish found the heaviest token users were about twice as productive as their peers while spending ten times the tokens. If the marginal token kept paying at ten-times consumption, frugality was the expensive choice.

Here is why I design for cost anyway, and the first reason is in the same reporting. Consumption is outrunning the price curve. The TechCrunch investigation documents per-developer token consumption rising roughly 18.6-fold in nine months, Uber exhausting its 2026 AI coding budget by April, and Microsoft revoking its developers' Claude Code seats months after enabling them. Anthropic doubling its own per-day estimate inside a quarter is the same signal from the vendor side. In the enterprise coding rollouts that reporting covers, falling unit prices have not produced falling bills; they have produced more ambitious architectures, longer sessions, and wider fan-outs, the multipliers from the first section compounding faster than the rates decay. A fixed-volume workload can ride the price curve down; an agent fleet whose ambitions grow with the curve does not. Optimizing a depreciating input would be irrational. Attributing an exploding one is not.

The second reason is that the objection attacks a position this guide does not hold. Cost-aware design here is attribution plus deliberate lever-setting, and the attribution half appreciates as spend grows, regardless of where unit prices go. The Jellyfish finding cuts both ways: the spend of those twice-as-productive engineers is only defensible because someone could attribute it and measure the doubling. Even the levers mostly survive the objection, because effort defaults, cache-stable prefixes, and pruned tool results are hygiene decisions made once at design time, not standing router infrastructure that needs a maintenance rota. The piece of the objection I accept is the threshold: below material spend, building bespoke routing machinery is premature optimization, and the cost-controlled-evaluation literature makes the deeper version of the point, that accuracy claims without a cost axis are unfalsifiable, since repeated sampling can buy accuracy almost without bound. Design hygiene from day one, attribution as soon as spend is real, routers only when the attributed numbers say so. That ordering survives every version of the price curve I have seen.

What does ROI measurement look like once the denominator is real?

Attribution gives you cost per task; ROI needs the other half, what the task was worth, and the measurement evidence says intuition is not a substitute. METR's randomized controlled trial remains the sharpest result in the literature: 16 experienced open-source developers across 246 real tasks on their own repositories took 19 percent longer with AI assistance, while estimating afterward that they had been about 20 percent faster. The perception gap is the finding. Developers cannot feel their own productivity delta at the magnitudes in play, which means a rollout measured by satisfaction surveys is measuring sentiment, not return. METR's February 2026 follow-up complicates the headline honestly. The late-2025 re-run still pointed at a slowdown for the original cohort, with a confidence interval that now crosses zero, and found new recruits near break-even. METR is redesigning the experiment because developers who benefit most from AI increasingly decline to participate without it, a selection effect it says biases the estimates downward, and it now labels the early-2025 result outdated. The right reading is not that AI slows everyone down; it's that the error bars on unmeasured productivity claims are wider than the claims. That also closes the steel-man's open thread from the previous section: the Jellyfish heavy spenders' doubled output is only knowable, and only defensible as an observed association, because someone attributed both sides of the ratio; turning it into a causal ROI claim still takes a controlled comparison. Without even the measurement, "the marginal token keeps paying" is a feeling, and METR is the canonical demonstration of how far off that feeling runs.

So measure outcomes against attributed cost, and keep both sides honest. The numerator should be outcome-shaped: tasks resolved, changes shipped and retained, review load carried, the rollout-health signals I detailed in the month-three metrics post. The denominator is the attributed cost-per-task from the previous section, reconciled against the metered invoice rather than the telemetry estimate. Activity metrics, lines generated, suggestions accepted, sessions run, belong in neither position; the productivity-paradox data shows usage rising several times faster than delivery throughput, which is exactly what you would expect when activity is mistaken for output. Cost-per-task needs a denominator discipline of its own, because tasks vary wildly in size: hold comparisons within a task class, or track a rolling distribution rather than a single average, and accept wide error bars early. Attribution is the floor, not the ceiling. A team that can say one resolved support ticket costs 30 cents of attributed model spend, down from 45 last quarter, with resolution quality flat, is making an ROI claim that survives an auditor. A team that says the engineers love it and the bill seems fine is not making a claim at all. This measurement posture is the same one I bring to team adoption, where the unit of success is changed delivery behavior rather than activated seats.

How does this guide stay current?

A guide arguing that cost is a design property has to treat its own numbers the same way: as inputs that drift. The sources under this page move on two clocks and I treat them differently. The pricing and mechanics layer, per-model rates, cache multipliers, batch discounts, thinking billing, telemetry fields, and the Admin API surface, lives on continuously maintained vendor documentation roots, and I verify this page against the live pages rather than against snapshots; when a rate or mechanic changes, the body changes. The evidence layer is dated by nature: the agent token multipliers are a June 2025 vendor measurement nothing has superseded, the price-decline curve is Epoch AI's March 2025 analysis, the enterprise-spend reporting is a June 2026 snapshot, and the METR results carry their own 2025 and 2026 dates in the text. Those are framed as what the record showed when it was written, and the sources sibling carries the search trail for the rows that exceed the standing freshness caps, including why nothing fresher replaced them. When the measured picture changes materially, the body text changes with it, because a cost argument standing on stale rates is the same failure this guide exists to prevent: a number nobody chose.

The bill is the readout

An agent's cost-per-task is decided where its architecture is decided: which tier reasons, how hard it thinks, what stays cached, what gets re-read, and how wide the work fans out. The pricing surface rewards teams that treat those five levers as design inputs, with a ten-to-one spread across tiers, reads at a tenth of base input, and batch pricing at half rate, and it quietly punishes teams that treat cost as a finance-quarter surprise, with quadratic context growth, billed invisible thinking, and multipliers that compound through every loop. The discipline is not frugality. It is knowing the cost anatomy, setting the levers on purpose, attributing spend against the metered record instead of the in-session estimate, and measuring ROI as attributed cost per real outcome. Teams that do this spend more confidently, not less.

If you want a second set of eyes on your own cost anatomy, the lever order for your specific workload mix, or the attribution wiring between your telemetry and your invoice, this is work I do with engineering teams: a short session walking your highest-spend agent loop usually finds one lever set by accident, and it's rarely the one anyone suspected.