What is context engineering for a coding agent?

A coding agent on a long task in a large codebase fails or succeeds less because of how cleverly the prompt was worded than because of what was in its context window at the moment it acted. The window holds the system prompt, the instruction files loaded at startup, every file the agent has read, every command output, the tool definitions it can call, the memory it carried in from a previous session, and the entire running conversation. All of it competes for the same finite budget, and each part is a lever you set rather than a fixed given. Context engineering is the discipline of setting those levers deliberately.

Anthropic's engineering team gives the practice its working definition, and it is worth stating in their words because the framing is the whole argument:

Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts. Anthropic, "Effective context engineering for AI agents"

The goal that follows from that definition is not "fit more in." It is to find the smallest set of high-signal tokens that gets the desired outcome. That reframing matters most in a large codebase, because a large codebase is the environment where the naive instinct, load everything the agent might need, is both possible and ruinous. A modern coding agent can be pointed at a repository with hundreds of thousands of files, and its window can hold a million tokens. The temptation is to treat the window as a bucket and the repository as something to pour in. The rest of this guide is about why that fails and what to do instead.

This cornerstone is the coding-agent companion to the context engineering glossary entry, which defines the practice in general terms. Here I narrow it to the case I work in most: an autonomous agent making changes across a real, large codebase, where the context window is the binding constraint and the failure modes are concrete. It sits alongside the agent reliability in production cornerstone, which asks whether an agent's output holds up; this guide asks the upstream question of whether the agent could even see what it needed to produce good output in the first place. A reliable model fed a polluted window still ships its worst work. The window is the part you engineer. It is not the whole unit of delivery; work lands and is checked in files, memory, tests, and validators outside it. It is the working set that governs what the agent can attend to at the moment it acts, and the part you most directly control, which is why this guide treats it as the thing to engineer.

Why doesn't a bigger context window fix this?

The marketing answer is that a million-token window means you no longer have to choose what the agent sees. The measured answer is the opposite, and the gap between the two is the reason this discipline exists.

Start with a study built to isolate one variable. A 2025 study, "Context length alone hurts LLM performance despite perfect retrieval", kept the answer present in the context and grew the amount of surrounding text around it, padding the input so that length itself was the thing under test. Accuracy fell as the input grew. In one condition, a frontier model dropped from about 82 percent on a knowledge benchmark to roughly 15 percent as input scaled toward 30K tokens, and a smaller model on a coding benchmark fell by a comparable margin under the same scaling. The information was there; the model still could not use it once the window was full. The mechanism behind this has been named since 2023, when "Lost in the Middle" documented the U-shaped curve: models attend well to the start and end of a long context and poorly to the middle, to the point where information placed in the middle of a long retrieved set provided little measurable benefit over providing no documents at all.

On coding specifically the numbers are starker, because code is unforgiving of a half-remembered detail. LongCodeBench, which evaluates coding models at context scales from 32K up to 1M tokens, recorded a leading model falling from 29 percent to 3 percent on its long software-engineering split as the context scaled from 32K to 256K tokens, and described long context as a weakness shared by every model tested. The takeaway is not that a long window is useless but that filling it is not free: past a point, more tokens buy less usable recall, and code is where that tax bites hardest.

A separate 2026 study of long-context reasoning in automated bug fixing found that the agentic trajectories that succeeded typically stayed under 20K to 30K tokens. One model achieving 31 percent in a lean agentic configuration collapsed to zero under a 64K-token long-context condition. The pattern that the practitioner community named "context rot", and that a controlled study of eighteen frontier models reproduced across all of them, is the same observation from the model's side: as the token count climbs, the ability to recall any specific fact from the window declines, and it begins declining well before the advertised limit.

There is a cost dimension layered on top of the accuracy one. Every token in the window is paid for on every turn, which is why prompt caching exists: Anthropic's caching documentation prices a cache read at roughly one-tenth the cost of a normal input token, a ninety percent reduction that only pays off if the cached prefix is stable and deliberately constructed. A window you fill carelessly is a window you pay full price to reprocess every turn while it actively degrades your results. The larger window did not free you from the choice of what to include. It raised the stakes on getting that choice right, because now you can afford to be wrong at a much larger scale.

What goes in a coding agent's context, and who decides?

If the window is the unit of work, the first job is knowing what is in it. A practical inventory for a coding agent is five layers, and I mapped this same set when I built a tool to audit them: the instruction files loaded at the project root and in nested directories, the agent and subagent definitions, the settings and hook scripts, the skill files that carry domain knowledge, and the application code where the model is called. That five-layer audit is documented in a case study, and the reason it was worth building a tool for is that every one of those layers is loaded on some schedule, and anything loaded automatically is paid for on every turn.

The highest-leverage layer to curate is the persistent instruction file, the CLAUDE.md that an agent reads at the start of every conversation. Anthropic's own best-practices guidance is unusually blunt about how to treat it: include only what the agent cannot infer from the code itself, keep it short, check it into version control so the team contributes to it, and prune it ruthlessly. In their words, bloated instruction files cause the model to ignore the actual instructions. The guidance recommends treating the file like code, reviewing it when behavior drifts and testing changes by observing whether the agent's behavior shifts. This is the heart of what practitioners mean by context as code: the file that shapes the agent is a versioned, reviewed, pruned artifact, not a scratchpad. I have written separately about where each kind of rule should live across instruction files, settings, skills, and hooks, because the routing decision is itself a context-budget decision: every rule you put in the always-loaded file is a rule you pay for on every turn whether or not it is relevant.

The evidence that curation is a real lever, and that careless curation backfires, is now empirical. A 2026 study of AGENTS.md instruction files measured a reduction of about 29 percent in wall-clock completion time and about 17 percent in output token consumption across more than a hundred real pull requests, though it is worth being precise that the study measured efficiency, not task-success rate. A second 2026 study from ETH Zurich evaluating repository-level context files is the one that should change how you write them: human-written files focused on specific, non-inferable constraints improved success by about 4 percent, while machine-generated files that led with codebase overviews actually reduced it, by about half a percent on SWE-bench Lite and two percent on AgentBench, and raised inference cost by twenty to twenty-three percent across those benchmarks. The lesson is not that instruction files do not work. It is that an instruction file repeating what the agent can already read is negative-value context, and that the discipline is curation, not accumulation. The same logic governs agent memory, the durable notes an agent carries between sessions: memory is a context layer you govern, and I have written about reading the files an agent actually wrote about a project precisely because what persists into the next session's window deserves the same scrutiny as what you write by hand. Knowledge that is only sometimes relevant does not belong in the always-loaded file at all; it belongs in on-demand skill files that the agent loads only when their description matches the task, which is how you give an agent a hundred capabilities without paying for a hundred capabilities on every turn.

There is a hard limit to what curation alone buys you, and it is the honest seam in this section. Instruction files are advisory. The agent reads them and decides, every turn, how literally to apply them, and under context pressure it often stops applying them. This is where context engineering hands off to enforcement, a boundary the governing agentic development discussion turns on, and which I take up in the failure-mode and evaluation sections below.

How does the agent get the right code into context?

If you cannot pour the whole repository into the window, the agent has to fetch what it needs on demand. This is just-in-time retrieval, and it is the move that lets a coding agent operate on a codebase far larger than its window. Anthropic describes the pattern its own coding agent uses as a hybrid: the instruction file is loaded up front, while primitives like glob and grep let the agent navigate the file system and pull files into context just in time, the way an engineer greps a repository rather than memorizing it. The win is structural. The agent holds a small, task-shaped working set instead of a large, mostly-irrelevant one.

The most consequential recent development on this front is search over tools rather than over files. A coding agent in a real environment may have access to dozens of external tools through the Model Context Protocol, and naively every one of those tool definitions sits in the window before any work begins. Anthropic's advanced tool use work reports that a handful of connected servers can consume tens of thousands of tokens in definitions alone. Loading tool definitions on demand instead, searching for the few relevant tools rather than presenting all of them, cut token consumption by more than eighty percent while raising tool-selection accuracy from about 80 percent to 88 percent on one model. The same principle that governs file context governs tool context: present the smallest relevant set, retrieved on demand, not the whole catalog up front. The deeper treatment of this protocol layer lives in the MCP servers in production cornerstone; here it is one more surface where just-in-time beats load-everything.

The benchmarks that test this on real codebases are blunt about both the upside and the limits. Sourcegraph's vendor-produced CodeScaleBench, run across more than forty repositories including Kubernetes and Django, found that retrieval-augmented agents roughly doubled file recall and more than tripled retrieval precision over a baseline. In one case from that benchmark, a baseline agent that timed out after two hours on a large monorepo was replaced by a retrieval-equipped run that finished the same task in 89 seconds. A 2026 study on "the navigation paradox" found graph-structured code navigation reaching 99 percent task success on hidden-dependency problems against 78 percent for keyword search, but it also found the real bottleneck is adoption: agents that invoked the navigation tool succeeded almost always, and the 58 percent of trials where the agent skipped the available tool fell back to baseline performance. That is a context-engineering finding disguised as a retrieval finding. Giving the agent a good retrieval path is necessary; getting the agent to use it, rather than dumping files into the window, is the other half.

Even with the right retrieval path, the gains have a ceiling. The sober counterweight comes from ContextBench, which found current coding agents favor recall over precision and leave a substantial gap between the context they explore and the context they put to use, and that sophisticated scaffolding produced only marginal gains over simpler baselines. Retrieval is a lever, not a cure. The most striking result in this space, a 2026 preprint titled "Coding Agents are Effective Long-Context Processors" finding that organizing long-context work in the file system beat the published state of the art by about 17 percent across a set of long-context benchmarks, points the same way: the durable answer is to keep the window small and the external store large, not to make the window enormous.

How does context engineering fail?

Curated context still fails, and the failures have names. The most useful taxonomy comes from Drew Breunig, who sorted the ways long contexts fail into four modes that any production team will recognize. Context poisoning is when a hallucination or error enters the window and then gets referenced again and again, compounding because the agent treats its own earlier mistake as established fact. Context distraction is when the window grows so long that the model over-focuses on its accumulated history and neglects what it learned in training. Context confusion is when superfluous content in the window drags the response toward irrelevance. Context clash is when new information or tools contradict what is already in the window, and the model has no principled way to reconcile the two. These are not abstractions; they are the daily failure surface of an agent on a long task, and the glossary entry's phrase for the slow version, silent context rot, captures the worst property: a window that has quietly filled with stale output still returns an answer, just a worse-grounded one, and nothing surfaces the degradation unless you built something to surface it.

The quantitative backing for these modes has arrived fast. A widely cited 2025 study found that when a task was split across multiple conversational turns rather than delivered all at once, performance dropped by an average of 39 percent across fifteen models. The models made early assumptions and then clung to them as the context accumulated, a direct mechanical account of context poisoning and clash. A 2025 study of why multi-agent systems fail annotated more than 1,600 execution traces and catalogued fourteen distinct failure modes, several of them, loss of conversation history, step repetition, reasoning that no longer matches the accumulated context, squarely context-management failures rather than reasoning failures. And on long iterative coding sessions specifically, SlopCodeBench found that agent-written code grew more than twice as verbose and structurally eroded over a long horizon compared to reference code, with structural erosion rising across more than three-quarters of trajectories, which is what context distraction looks like when it accumulates over hours.

The most instructive failure I have on record is one I documented myself. When a new model release could not be trusted about its own tool state, the agent's belief about what its tools had returned diverged from what they had actually returned, and the fix was a context-engineering fix: have the agent redirect command output to a file and read it back, rather than trusting the in-context summary of what happened. That is context poisoning caught in the wild, and the response was to move the source of truth out of the window and onto disk where it could not be silently corrupted. The pattern recurs at scale: when you run many subagents at once, the thing that protects the work is not the agents' shared context but the verification contract you wrote before they started, because a shared, accumulating window across many workers is a poisoning surface waiting to happen. The mitigation that recurs across all four failure modes is the same one the glossary entry names: isolate work in a subagent with its own window so its intermediate tokens never enter the main one, and the main conversation receives only the distilled result.

How do you know your context strategy is working?

A failure mode you cannot measure is one you will ship. The uncomfortable state of practice is that most teams cannot measure this one well yet. LangChain's State of Agent Engineering survey of more than 1,300 practitioners found that while nearly all production deployments had some observability and most had full tracing, evaluation still leaned on people: close to 60 percent used human review and about half used a model as a judge. Quality remained the top deployment blocker. The instrumentation the discipline needs is not yet standard, which means the teams that build it have an edge.

What you instrument is specific. Agent observability work argues that the spans worth capturing are not the ones a traditional uptime dashboard shows; a service can be up and an agent can still pick the wrong tool, loop silently, or return a confidently wrong answer with a 200 status code. The spans that matter for context quality are tool calls, reasoning steps, the state of working memory before and after each step, and every memory read or write, because those are where the window changes and therefore where it can rot. The benchmarks point the same way: a 2026 context-learning benchmark for coding found that accurately summarized and retrieved prior context improved both resolution rate and cost on hard tasks, while unfiltered or wrongly selected context delivered limited or even negative benefit. Measuring context quality means measuring not how much the agent gathered but how much of what it gathered it used well.

The honest complication, and the reason this section does not promise a clean metric, is that the most careful study I found cuts against the simplest measurement you might reach for. An analysis of more than 9,000 agent trajectories titled "beyond resolution rates" found that the obvious correlation, longer trajectories fail more, is largely a confound of task difficulty: harder tasks produce both longer contexts and more failures, and once you control for difficulty the naive correlation weakens or reverses. The study's finding was that the agent's context-gathering strategy before it started editing predicted success better than raw session length did. So the metric is not "keep the window short." It is "gather the right context before acting, and instrument whether you did." That is harder to measure, and it is the right thing to measure. The enforcement layer that makes any of this stick is the subject of the deterministic validators and verification loop that the agentic AI governance in production cornerstone treats in full; context engineering decides what the agent sees, and those gates decide what is allowed to ship regardless.

Isn't this just going away as context windows get bigger and cheaper?

This is the strongest objection, and it deserves the strongest version rather than a strawman. The argument runs: long-context models keep improving, windows keep growing and getting cheaper, and the whole apparatus of retrieval, compaction, and curation is scaffolding that the next model generation will render unnecessary. There is real evidence behind it. A 2025 result showed that a simple retrieve-then-read baseline with a long-context model consistently matched or beat more elaborate multi-stage retrieval pipelines, which is a genuine argument that complexity for its own sake is not paying off. The "context length alone hurts" study itself found that a lightweight prompting intervention recovered a few percentage points of the loss, suggesting some of the degradation is addressable without abandoning large windows. And the strongest single counter-result, that organizing long-context work in the file system beat the published state of the art by about 17 percent, can be read as saying the answer is architecture, not eviction policy.

I take the objection seriously, and I think it sharpens the thesis rather than refuting it. Notice what each of those results says. The retrieve-then-read baseline is still retrieval; it just argues for the simplest retrieval that fits the window, which is a context-engineering choice, not the absence of one. The recovery from a prompting intervention is the discipline working, not the discipline being unnecessary. And the file-system externalization result is the purest statement of the thesis available: a recurring production pattern is to keep the window small and put the durable state outside it, loading the relevant subset back in when the work needs it. One version of the objection, that bigger windows will erode the advantage of light-touch curation faster than anyone can refine it, runs into the one finding curation skeptics most need to explain away: instruction files, the lightest-touch form of curation, are advisory, and a 2026 study of enforcing executable constraints from instruction files measured passive instruction files being honored on only about two-thirds of constraint checks, with executable enforcement closing most of the remaining gap. A larger window has not been shown to raise that compliance number; for constraints that must hold every time, capacity is not the control, executable enforcement is.

I should concede the strongest form of the objection rather than absorb it, because folding every counter-result into "that is still context engineering" can slide into being unfalsifiable. The strongest version is not about instruction files at all. It is that a high-recall architecture, a large cheap window deliberately filled and then checked by retrieval, tests, and validators, may beat tight curation in environments where a retrieval miss costs more than a little distraction does. I think that is sometimes true, and it sharpens the thesis into a testable claim rather than refuting it. The durable claim is not that smaller windows always win. It is that reliability depends on a tested policy for what evidence reaches the model and its tools at the moment it acts, and that any such policy, curated or high-recall, should be measured against the other on task success, retrieval misses, cost, and validator failures. If the big-window baseline wins on those, it has earned the window.

So the resolution is not "curation versus big windows." It is that reliability comes from a stack: curated instruction files for what the agent should know, just-in-time retrieval for what it needs now, durable external memory for what must survive, named failure-mode monitoring for what goes wrong, and deterministic enforcement for what must happen every time. The window getting bigger changes the budget. It does not retire the job of spending it well.

How does this page stay current?

This cornerstone is the coding-agent deep companion to the context engineering glossary entry and a peer of the agent reliability in production cornerstone. The primitives it names connect outward: the context window is the constraint, agent memory is the durable store outside it, prompt caching is the economics of a stable prefix, the CLAUDE.md instruction file is the curated always-loaded layer, the Model Context Protocol is where tool context is retrieved on demand, and a subagent is the isolation boundary that keeps one task's tokens out of another's window. The enforcement half of the argument lives in the agentic AI governance in production and MCP servers in production cornerstones, and the foundational workflow it assumes is the plan, audit, implement, verify cycle.

The anchor for this page is a first-party operational record, kept next to the body and updated when a context-budgeting practice, a retrieval pattern, or a failure-mode mitigation changes in how I run agents day to day. The Sources roster tracks the freshness of each external anchor under this site's caps: three months for AI and tool statistics, six months for tool-capability claims. A row past its cap is held only when a documented search trail shows what was looked at and why nothing fresher qualified, which for a fast-moving research field means the landmark papers that named a phenomenon stay even as fresher work extends them.

Building the context layer for a team's own codebase, the curated instruction files, the retrieval setup, the memory governance, and the deterministic gates that keep curation honest, is what a consultation, workshop, or implementation engagement around agentic development is for. The window is the part you engineer, and engineering it well is the difference between an agent that helps on the easy first turns and one that stays reliable deep into a large codebase.