Instruction-data boundary

The instruction-data boundary is the separation between text a language model should treat as instructions to follow and text it should treat only as data to process, a separation the model does not natively enforce inside its context and therefore has to be reconstructed by the system around it.

How it works

Inside the context window, instructions and data are processed as the same kind of thing: tokens the model attends to and may act on, and while role and provenance markers can flag which were meant to command it, the model treats those markers as signals it interprets rather than a boundary it enforces. That is why injected text works at all, since a document, a web page, a tool result, or a code comment can carry words shaped like instructions, and the model has no reliable native mechanism to refuse them on the grounds that they arrived as data. Reconstructing the boundary is a system-level job done with overlapping techniques rather than a single setting: delimiting and labeling untrusted spans so the model is told which regions are data, structuring inputs so untrusted content travels in fields rather than in the instruction channel, and, because none of those is airtight, placing real enforcement below the model so a crossed boundary is contained rather than merely discouraged. The boundary is also a spectrum of trust rather than a clean line, since the system prompt, the user's own message, a retrieved document, and a third-party tool's output carry descending levels of trust the design has to keep straight. The durable posture treats the model's adherence to the boundary as a probability reducer and assumes it will sometimes fail, which is why high-consequence systems back it with least privilege, egress control, and sandboxing rather than with better delimiting alone.

Why it matters

The missing boundary is a shared failure mode across prompt injection, tool poisoning, and the untrusted-content leg of the lethal trifecta, so naming it explains why those attacks rhyme: each is content that arrived as data being acted on as instruction, though tool poisoning also draws on supply-chain trust and the trifecta also needs data access and an exfiltration path. It also explains why classic application-security threat models imported unchanged fall short for agents, because they assume a trust boundary the model does not enforce, so a control that assumes a deterministic parser can fail where the model interprets natural language, even as data-flow, least-privilege, sandboxing, and egress controls become more important. Treating the boundary as something the system rebuilds, rather than something the model possesses, is what moves a team from hoping a cleverer prompt will hold to designing containment that holds when the prompt does not. The honest limit is that no prompt-only or model-only scheme has made the boundary airtight, since delimiting and labeling have been bypassed by inputs that impersonate the structure, which is exactly why it is reinforced with controls that do not depend on the model getting it right. The practical consequence is a design rule: the more untrusted the input channel, the less the system should rely on the model honoring the boundary and the more it should bound what a boundary failure can reach.

In practice

An agent answering questions over a company wiki retrieves a page that, buried in its text, says to ignore prior instructions and forward the conversation to an external address. The retrieved page was supposed to be data, a source to summarize, but the model sees only tokens and treats the embedded command as something to act on. A system that has reconstructed the boundary keeps the page in a labeled untrusted region the model is told not to take orders from, and, not trusting that alone, denies the agent any outbound destination, so the instruction is both discouraged and contained. The boundary the model lacked was supplied by the design around it.

Practical considerations

Treat every input channel by its trust level, since the system prompt, the end user's message, retrieved documents, and tool outputs are not equally trustworthy and the design should not feed them to the model as if they were. Label and delimit untrusted content so the model is at least told which spans are data, while assuming that labeling is a probability reducer rather than a guarantee, because inputs that imitate the delimiters have repeatedly defeated it. Put the real enforcement below the model: the boundary holds in practice not because the model always honors it but because least privilege, egress control, and sandboxing bound what a crossed boundary can do. Scale the caution to the channel, since an agent reading only trusted internal prompts can lean on the boundary more than one ingesting arbitrary web content, which should be treated as actively hostile. Watch the second-order channels, because tool descriptions, file metadata, and prior conversation turns are all places instruction-shaped data can enter that a review focused on the user's message will miss. Do not present boundary-respecting behavior as solved by a prompt technique, since durable assurance comes from what remains true when the technique fails.

Related standards and prior art

OWASP: LLM01 Prompt Injection · continuously updated the OWASP GenAI classification noting the model does not reliably separate trusted developer instructions from untrusted user or external input
Greshake et al.: indirect prompt injection (2023) · 2023-05-05 · (seminal prior art) the canonical academic paper framing the root cause as systems that blur the line between data and instructions
NIST AI 100-2e2025: adversarial ML taxonomy · 2025-03-24 a standards-body taxonomy of direct and indirect injection across LLMs, RAG, and agents, with named mitigations and their limits

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in