Every guide to spending less in Claude Code hands you the same list. Switch to a cheaper model. Clear your context. Write shorter prompts. The advice is fine. It's also why you can follow all of it and still burn your weekly Opus limit by Wednesday, with worse code to show for the trouble.

Here is the part the lists miss. For most of the work you do, the moves that cut your token spend are the same moves that sharpen the output. So this is not a tradeoff you split down the middle. Three levers move both, and the goal is not minimum spend. It is the cheapest setup that still clears your quality bar.

One scoping note, because it changes the math. By agentic development I mean the autonomous loop: Claude Code plans, reads files, runs commands, and edits code while you supervise, instead of you driving each step. That loop is where token spend grows fast, and it is who this post is for, not people typing one question at a time.

GitHub tested the "spend and quality move together" idea on its own work. It tuned a dozen of its agentic workflows, the kind that run in GitHub Actions and call a model in a loop; the published results cover five with enough before-and-after runs. Token counts fell by up to 62 percent. Where GitHub tracked process signals like completion, they stayed steady. It did not directly grade output quality, so read that as "cheaper, not worse," not "cheaper and better." The waste was structural. Unused tool registrations padded the context, and pruning them cut per-call context by 8 to 12 kilobytes. Routine data-gathering ran through the model instead of a plain script. None of it made the output better; it just cost more.

So this is not a frugality lecture. The GitHub case proves the cheap half: you can strip structural waste without the work getting worse. The quality half comes from two mechanisms you will meet below, overthinking on routine tasks and fading recall in a bloated context. Three levers carry both: the model and effort you run, the shape of your context, and how wide you fan out. I'll take them one at a time. First, the idea that ties them together.

Tip

Start here, then read why. A default that keeps spend low without much thought: run /model opusplan for ambiguous implementation work, or Sonnet at /effort medium for scoped coding. Escalate to Opus at /effort xhigh only for architecture, hard debugging, or final review. Run /clear between unrelated tasks and /compact only when one task has grown long. Fan out to subagents only when the pieces are truly independent, and check /usage and /context before you change anything else. The three lever sections below are the why.

The thing that costs you is the session, not the prompt

It's natural to picture token spend like a taxi meter on each message. You send a prompt, you pay for that prompt, the meter resets. That's not how an agentic session works. Not even close.

Claude Code re-reads the whole context on every turn. Context is everything the model can see right now: the system prompt, your CLAUDE.md rules, the files you have opened, the commands you have run, and the conversation so far. Each new turn the model processes all of it again. So the cost of turn forty is not the cost of your fortieth instruction. It's the cost of carrying everything that came before it, one more time.

This is why a long session gets expensive even when your prompts stay short. A 2026 study of real coding-agent runs, a withdrawn ICLR submission analyzing OpenHands agents on SWE-bench, found that input tokens, the stuff being re-read, dominate the cost of agentic coding. The same study found something humbling: how hard a task looks is a weak predictor of its token use. Harder tasks trend higher, but the variance is large, and the same task can use ten times more tokens on one run than another. The last prompt is rarely the whole driver. What dominates is the accumulated shape of the session, and prompts matter most when they change that shape: which tools run, how much output you keep, when the loop stops. You cannot control one run. You can control the shape every run starts from, and that is what the three levers do.

Look at what is loaded before you type a single word. In one documented Claude Code setup, the startup context, system prompt plus memory plus your CLAUDE.md and tool names, came to roughly 7,850 tokens. Then a single file read can add a couple thousand more. A noisy command can dump tens of thousands.

0 tokens loaded at startup, before your first prompt (documented example)
0 tokens one file read can add to your context
0 default token cap on a single MCP (Model Context Protocol) tool's output

Now, what does "spend" even mean here? It splits two ways, and the split matters.

If you are on Pro or Max, you do not pay per token. Your tokens show up as limits, not invoices. As of June 2026, Pro has a five-hour usage window plus a weekly cap. Max raises the per-session allowance by 5x or 20x but still has weekly caps on top. Hit a cap and Claude Code slows or stops until the window rolls over. That is the wall most agentic users feel: not a bill, a locked door on a Wednesday afternoon.

If you run Claude Code on an API key instead, you pay in dollars per token. Anthropic's own enterprise numbers put the average around $13 per developer per active day, and under $30 a day for ninety percent of users.

Either way, you are spending the session's shape, not the prompts.

Lever one: stop reaching for Opus at full effort by default

The single biggest line item is which model runs and how hard it thinks. You rarely set this on purpose. You take the default, and the default is often the most expensive thing in the room.

Two numbers set the price, and the ratio is the part to remember: Opus runs about 1.7 times Sonnet per token, and Haiku is cheaper than both. On a subscription, Opus also draws down your cap far faster. The current June 2026 rates, per million tokens: Opus 4.5 through 4.8 at $5 in and $25 out, Sonnet 4.6 at $3 and $15, Haiku 4.5 at $1 and $5.

Effort is the second number. It is a dial for how hard the model works on a turn: how much it thinks, how many tools it calls, how much it writes. It is not a hard cap. Higher effort usually means more thinking, more tool calls, and more output, so it moves the whole token bill, not just the reasoning. Opus 4.8 and 4.7 offer five levels, from low up through xhigh and max. Sonnet 4.6 goes up to max, skipping only xhigh.

ModelPrice in / out (per Mtok)Effort rangeDefault effort in Claude Code
Opus 4.8$5 / $25low to max (incl. xhigh)high
Sonnet 4.6$3 / $15low to max (no xhigh)high
Haiku 4.5$1 / $5no effort parameternot applicable

Those rates are current as of June 2026, from Anthropic's pricing and model configuration docs. Prices and defaults move, so check the live page before you wire a rule around a number.

Here is where the quality-and-cost link shows up in the open. You might assume max effort always means a better answer. Anthropic's own docs say the opposite for routine work: on most workloads, max "adds significant cost for relatively small quality gains," and on structured tasks it "can lead to overthinking." The setting Anthropic recommends as the starting point for coding and agentic work is xhigh, not max. For Sonnet 4.6, the recommended agentic default is medium, even though it ships at high.

On routine work, the cheaper setting is often the better one too, because max effort can overthink. It is not a universal law. But it is the common case, and it is the opposite of what most cost advice assumes. Max effort on a rote file edit doesn't just cost more. It wanders.

So what do you do? Route by task, not by habit.

  • For mechanical work, classification, formatting, simple edits, reach for Haiku or low-effort Sonnet.
  • For most coding in an agentic loop, Sonnet 4.6 at medium or high is the workhorse.
  • For genuinely hard reasoning, Opus, and set /effort xhigh by hand. Note that Opus 4.8 ships at high in Claude Code, not xhigh, so the hardest tasks need you to turn it up.

There is a built-in setting that does some of this for you. opusplan (set it with /model opusplan) runs Opus while the agent plans, then drops to Sonnet to execute. You keep the stronger model on the part that needs judgment and hand the bulk typing to the cheaper one.

One caveat to state carefully, because it is easy to misquote. Opus 4.7 and later use a new tokenizer that can use up to 35 percent more tokens for the same text than earlier models. The per-token price did not change, so your effective Opus cost runs higher than the sticker rate suggests. That 35 percent figure is specific to Opus 4.7 and up. It does not apply to Sonnet 4.6 or Haiku 4.5.

I lean on this routing in production. The AI Readiness Assessment behind this site runs a roughly twenty-turn agent, and it does not use one model for the whole thing. Haiku 4.5 handles the turn-zero classification, the first routing decision before the conversation starts. Sonnet 4.6 carries the nineteen conversational turns. Opus 4.7, at xhigh effort, runs only the final synthesis, the one turn where the reasoning depth earns its price. I run it this way because the cheap turns are low-risk and the one expensive turn is where synthesis quality actually matters, so the routing saves spend without putting the hard reasoning on a weaker model.

If you want the full grid of model against effort, I mapped it in the model and effort matrix, and the question of when xhigh earns its premium gets its own treatment in the effort ladder.

Lever two: the session is a multiplier, so keep it clean

Recall the first idea. Every turn re-reads everything in context. That makes context the multiplier on your whole session. A bloated context doesn't cost you once. It costs you on every turn that follows, forever, until you clear it.

It also makes the model worse, and that is the part the cost-cutting lists miss. As a session fills up, recall and accuracy drop. A model carrying 180,000 tokens of half-relevant history is not just pricier than one carrying 20,000 tokens of focused context. It is dumber. Long-context coding benchmarks have measured accuracy falling sharply as the window fills, on models a generation old, so read it as directional rather than exact. Practitioner reports on long Claude Code sessions match the pattern: the model loses track of decisions it made an hour ago, and the effective working window tops out well below the advertised maximum. So trimming context is not a tax you pay to save money. It is how you keep the answers sharp. The nuance is what you trim: clear unrelated history and bulky tool output, but keep the requirements, the decisions you have made, and the error traces the agent still needs. Cut those to save tokens and a clean session turns into a forgetful one.

The levers here are a handful of commands and one habit.

  • /clear starts a fresh, empty session. Use it between unrelated tasks. The prior session is still there if you need it. Leaving three unrelated jobs in one window taxes every later turn for no benefit.
  • /compact summarizes the conversation so far and drops the raw history. You keep the gist, you shed the bulk. It is the move for a long single task that has accumulated noise.
  • /context shows you what is loaded right now, broken down by system prompt, memory, skills, MCP tools, and messages. Run it when a session feels heavy. You can't fix what you can't see.
Before

A session you never clean

  • Three unrelated tasks share one window
  • Every file you opened is still loaded
  • The model re-reads all of it each turn
  • Slower, pricier, and it starts to drift
After

A session you shape

  • /clear between unrelated tasks
  • Big reads pushed into a subagent
  • Only what the task needs is in context
  • Cheaper, and the answers stay sharp

A few structural habits matter as much as the commands. Keep your CLAUDE.md tight; Anthropic recommends under 200 lines, because every line loads on every session. Watch MCP tool output, which can dump up to 25,000 tokens from a single call by default. And scope your file reads instead of letting the agent open a directory to "look around."

Claude Code will also compact for you when a session gets near full. As of mid-2026 that auto-compaction kicks in before the window is completely full, and you can lower that trigger but not raise it. The exact percentage has shifted across versions, so trust the behavior, not a number: it fires before the brim, and you cannot push it later. Useful to know, but don't lean on it. By the time auto-compaction fires, you have already paid to carry a near-full context for many turns. Clearing on purpose is cheaper than waiting for the floor to give way.

Lever three: fanning out cuts both ways

The third lever is the one that scales fastest in both directions, so it earns the most care. Subagents are separate Claude instances that Claude Code spins up to handle a piece of work in their own context window. Fan out across several and you get subagent orchestration: parallel agents working at once.

Used well, a subagent saves you context. It can read a giant log or run a noisy test suite in its own window and hand back a short summary, so the 6,000 tokens of raw output never land in your main session. That trims your main session for sure, but not always your total spend: the subagent still pays its own startup, reads, and reasoning. It nets out ahead only when the re-reading you avoid in the parent outweighs the subagent's own cost.

Used by reflex, fan-out is the most expensive thing you can do. Each agent carries its own full context and runs as its own model instance. Claude Code's docs put agent teams, where several teammates run together in plan mode, at about seven times the tokens of a standard session. Heavier builds cost more still: Anthropic's own internal research system, a larger multi-agent setup, runs about fifteen times a single chat. Those two numbers use different baselines, a session versus a single chat, so do not read them as one scale. The direction is what holds: more agents, more tokens, fast.

0 x the tokens an agent team burns versus a standard session in plan mode, because each teammate carries its own context window (Claude Code docs)

So the question is never "should I use subagents." It is whether the work splits into independent pieces that each run on a bounded context and return something compact you can audit. Independence alone is not enough: a subagent whose summary hides the evidence you need to check costs more than it saves. When the split is clean, fan-out buys real speed. I measured a three-agent parallel dispatch at 14.87 seconds against 43.8 seconds run one after another, roughly three times faster. That speed is real, and so is the token bill that comes with running three agents instead of one. I break that decision into a fuller checklist, when orchestration is worth it and when it is not, in when to orchestrate subagents.

Two more habits keep fan-out honest. Give subagents enough room to do their job. An under-resourced agent fails quietly: it returns a weaker answer instead of an error. That is the risk when a token-saving cut goes too far. And on any autonomous loop, set a stop condition and a token ceiling, so a run that goes sideways cannot burn your whole window before you notice. I wrote about designing those stop conditions and verification contracts separately; for spend, the point is simpler. A loop with no ceiling is a blank check.

You cannot tune what you cannot see

Everything above assumes you know where your tokens go. If you are guessing instead of measuring, you are flying blind. Given that the same task can swing ten times in token use from one run to the next, guessing is a bad plan. Measure instead.

The session view is /usage. If /cost is in your muscle memory, it still works: /cost and /stats are documented aliases that open the same screen. /usage shows your token use for the session, an estimated dollar figure for API users, plan bars for subscribers, and a breakdown across skills, subagents, plugins, and MCP servers, with 24-hour and 7-day views. That breakdown is the gold. It tells you which MCP server or which subagent is eating your budget, so you fix the thing that matters instead of trimming at random.

For a team, turn on telemetry. Set CLAUDE_CODE_ENABLE_TELEMETRY=1 and Claude Code exports OpenTelemetry metrics, including claude_code.cost.usage in dollars and claude_code.token.usage in tokens, tagged by whether the work came from the main thread, a subagent, or an auxiliary call. That is enough to see cost per developer, per model, per agent type, instead of one number at the end of the month. If you want the broader case for watching what an agent does rather than what it says it did, I made it in the post on agent observability.

Then there are the guardrails that stop a runaway before it spends. For an account ceiling: workspace spend limits and the /usage-credits monthly cap. For a single run: --max-turns on the CLI and max_budget_usd in the Agent SDK. Put the run-level caps on anything unattended.

Warning

The most common surprise bill is not a runaway loop. It is the ANTHROPIC_API_KEY in your environment. If that key is set, Claude Code bills your API account at per-token rates and ignores your Pro or Max subscription entirely. It can be inherited from a .zshrc, a .env file, or a CI secret you forgot about. Run unset ANTHROPIC_API_KEY and confirm your login with /status if you meant to be on your subscription.

One change worth a note before it lands. Starting June 15, 2026, Agent SDK and headless claude -p usage on a subscription will draw from a separate monthly credit pool instead of your interactive limits. If you run automation through the SDK, that changes your budget math. I worked through what it means per workload in the piece on the Agent SDK credit pool. For day-to-day interactive use in the terminal, it changes nothing.

The plan you are on sets the ceiling these three levers work within, and it is the choice I get asked about most. My rough rule, from provisioning this across a team: a daily-driver engineer wants Max 5x at $100 a month, and a true power user, someone running agentic loops most of the day, is better off on Max 20x at $200. The entry tier is a different product, not a discount. Match the plan to the usage shape, then tune the three levers so the plan you have goes further.

When the right move is to spend more

I've spent this whole post arguing that spending less and getting better output are the same job. True, for the work you do most days. But not everywhere. The boundary is worth naming, because optimizing past it is its own kind of waste.

Some tasks sit at the frontier of what the model can do. For those, Anthropic's guidance is explicit: max effort is the right call, the one setting that buys capability xhigh cannot reach. Dropping to high to save tokens on a genuinely hard architectural problem can hand you a worse answer. The trouble is that you cannot always tell in advance which bucket a task is in. When the stakes are high and the problem is hard, spend the tokens.

Model choice has the same edge. Swapping Opus for Sonnet on routine work is free quality. Doing it on a task that needs Opus's reasoning is not. Research on shrinking model capability to save cost found real-world accuracy dropping ten to fifteen percent in 2025 tests, once the model was squeezed too hard. The lever is safe in the middle of its range and dangerous at the end.

And here is the honest scope. An Anthropic engineer put the share of users hit by peak-hour limits at around seven percent. If you are not in that slice, and you are not paying per token through an API key, the cheapest move is to not spend your attention on this at all. Tune the defaults once and get back to work.

This is also where the levers meet the bigger picture. The seat license is the cheapest part of an AI coding tool, a point I argued in the post on true cost. These levers are not about pinching the seat. They are about not degrading the output by dragging a bloated context or the wrong model through your work. The cost saving is the byproduct. The sharper output is the point.

FAQ

Does Claude Code use my subscription or my API key? If ANTHROPIC_API_KEY is set in your shell, Claude Code uses it and bills your API account at per-token rates, ignoring your Pro or Max plan. The key can come from a .zshrc, a .env file, or a CI secret. Run unset ANTHROPIC_API_KEY and confirm with /status. This only affects terminal sessions; Claude Desktop and the web sandbox use your subscription login.

Is it /usage or /cost for token spend? Either. /usage is the primary command, and /cost and /stats are documented aliases that open the same screen. It shows session token use, an estimated dollar cost for API users, plan bars for subscribers, and a breakdown across skills, subagents, plugins, and MCP servers. The session figure is a local estimate and does not include other devices or claude.ai; your authoritative bill is in the Console.

Is Max 5x or Max 20x worth it for daily agentic work? Max 5x ($100/month) gives five times Pro's per-session usage; Max 20x ($200/month) gives twenty. If you run agentic loops most of the day and keep hitting the cap on 5x, the jump to 20x usually costs less than the time you lose waiting for the window to reset. If you only hit the wall now and then, tune the three levers first.

What is opusplan mode and does it save tokens? opusplan (set with /model opusplan) runs Opus while the agent plans, then switches to Sonnet to execute. Opus costs about 1.7x Sonnet per token, so handing code generation to Sonnet cuts spend while keeping planning on the stronger model. One caveat: the Opus planning phase uses the 200K context window, not the 1M window.

Are subagents worth the extra tokens? Sometimes. A subagent runs in its own context window, so it can read a large file or run a noisy command and return a short summary, keeping your main session clean. But agent teams cost about seven times a standard session in plan mode. Fan out when the subtasks are truly independent and the saved context beats the multiplied spend, not by default.

The knob in three places

Token spend in Claude Code is not a meter on your prompts. It's the shape of your session, and three structural choices set that shape: the model and effort you run, how clean you keep your context, and how wide you fan out. Tune those and your weekly limit goes further. The reason it's worth doing isn't the money. It's that every one of those choices also decides whether the output is any good. For the day-to-day, you are not trading quality for cost. You are finding the cheapest setup that still holds the quality bar, and it usually sits below your default.

If you want the routing half in more depth, start with the model and effort matrix. And if you would rather have these levers wired into your team's workflow than tune them by hand, that is the kind of thing I build in an agentic workflow engagement. A fifteen-minute call is enough to scope it.

Glossary terms used