The bounded span of tokens a model can read and reason over in a single request, holding the system prompt, prior turns, supplied material, the current input, and the reasoning and output the model produces.
How it works
Every request is assembled into one sequence of tokens that must fit inside the window: the system prompt, the running conversation, any retrieved or pasted material, the current message, and the tokens the model emits in reply all draw from the same budget. When the running total approaches the limit something has to give, so a long session may truncate its oldest turns or compact them into a summary that costs fewer tokens than the original. The model attends across whatever sits in the window, but recall across it is not uniform: information near the start and end of a long context is often retrieved more reliably than information in the middle, though the strength of the effect varies by model and task. The window is the working surface for a single turn, and nothing outside it exists for the model until it is read back in.
Why it matters
Treating the window as free capacity is the common mistake: filling it to the brim does not make a model better informed, because recall can degrade as the window fills and the signal you need competes with everything else packed alongside it. The trade-off cuts both ways. A window too small for the task forces lossy truncation and the model answers without context it needed; a window stuffed to capacity dilutes attention and the model answers around the context rather than from it. The discipline that matters is not maximizing what fits but curating what belongs, so the tokens in the window are the ones the answer actually depends on.
In practice
A long debugging session accumulates dozens of turns until the running history crowds the window; rather than carry every turn verbatim, the session compacts the older exchanges into a summary and keeps the recent, load-bearing detail at full fidelity. The model continues from the compacted history, so the budget is spent on what the current step depends on instead of on a transcript the step does not need.
Practical considerations
Context windows vary by model, and a larger maximum window is a different capability axis from how well a model uses the far end of it: the published limit is a ceiling, not a promise of uniform recall up to that ceiling. Within a single request, any reasoning tokens a model produces count against the window alongside input and output, so a turn that asks for deep internal reasoning has less room for everything else than the raw window size suggests. Two common management strategies trade off differently: truncation is cheap and lossy, dropping the oldest tokens outright, while compaction spends a summarization pass to keep a lossy summary of what it drops, paying tokens and latency now to save budget later. The order of material inside the window is itself a lever, since placing the tokens an answer depends on near the boundaries of a long context tends to survive the middle-of-context attention drop better than burying them in the center.
Related standards and prior art
- Anthropic: Context windows · continuously updated defines the context window as the model working span and names context rot as recall degradation as the window fills
- Chroma: Context Rot · 2025-07-14 · (origin of the context-rot framing) independent multi-model study measuring how recall degrades as input length grows
- Lost in the Middle (arXiv 2307.03172) · 2023-07-06 · (seminal prior art) seminal study showing models attend to the start and end of a long context more reliably than the middle
Defined by Ready Solutions AI