Prompt injection

Prompt injection is an attack in which attacker-controlled instructions, supplied directly by a user or embedded in content the model reads, exploit a model's inability to reliably separate trusted instructions from untrusted data, so that a user message, retrieved text, or a tool result can redirect the model's behavior.

How it works

A language model takes its developer instructions, the conversation, and any external material it reads into one working context, and even when an interface separates system, developer, user, and tool messages into roles, the model still interprets all of their text together, so those role labels do not reliably make it treat untrusted content as data rather than as instructions. Prompt injection abuses that: an attacker places instructions where the model will read them and follow them as if they came from the developer. In the direct form, the instructions arrive in the user's own message, the familiar attempt to override a system prompt. In the indirect form, which matters more for agents, the instructions are planted in content the model retrieves or a tool returns, a web page, a document, an email, a code comment, so the model is subverted by data it was asked to process rather than by anything the user typed. Because the model decides what to do by reading context rather than by enforcing a boundary, no amount of instructing it to ignore injected commands closes the hole reliably. The defenses that help are structural: filtering inputs and outputs, isolating and labeling untrusted content, granting least privilege, and requiring human approval before a consequential action.

Why it matters

Prompt injection matters most the moment a model stops only talking and starts acting, because an injected instruction that reaches an agent with tools can read data it should not, call a tool it should not, or take an irreversible action on the attacker's behalf. The reason it is hard is not a missing patch but the architecture: the same property that makes a model useful, following instructions written in natural language, is the property the attack exploits, so it cannot be fully eliminated the way a memory-safety bug can. That shifts the goal from prevention alone to prevention plus containment: reduce the chance an injection succeeds, and assume one sometimes will, so its blast radius is small by design. A model given broad tool access and untrusted input is a different risk surface than one that only emits text, which is why least privilege, isolation, and a human gate on consequential actions do more than any instruction to the model. The honest limit is that filtering and detection reduce the rate but do not reach zero, so a system that depends on catching every injection is one that has not planned for the one it misses.

In practice

An agent is asked to summarize a web page, and the page contains a line of hidden text instructing it to ignore its task and instead send the contents of a private file to an external address. A model that treats the page as pure data is fine; a model that reads the planted line as an instruction and has both file access and a way to make the request is not. The defense is not a cleverer instruction telling the agent to be careful, but the structure around it: the agent runs without the credentials or the network path the injected command would need, so even a successful injection reaches nothing worth reaching. What contains the attack is the boundary the agent operates inside, not the agent's judgment about the text it read.

Practical considerations

Indirect injection is the form that surprises teams, because the threat arrives through trusted-looking channels, a document, a search result, a tool response, rather than as obviously hostile user input. The defenses layer rather than substitute: input and output filtering catches some known patterns, isolating untrusted content limits its reach, least-privilege scoping shrinks what a compromised step can touch, and a human approval gate stops an irreversible action before it lands. None of these is sufficient alone, and an over-aggressive content filter blocks legitimate work while still missing novel phrasings, so the durable posture treats the model as an untrusted component and gates what it can do rather than trying to perfect what it reads. Agents that combine access to private data, exposure to untrusted content, and a way to communicate outward are the highest-risk shape, because those three together are what turn an injection into data exfiltration. Red-teaming with adversarial inputs belongs in the same test suite as functional tests, since an injection path that is not tested is one discovered in production. The defenses also cost something, latency on every filtered call and friction on every approval gate, so they are matched to the reversibility and reach of the action rather than applied uniformly.

Related standards and prior art

OWASP: LLM01 Prompt Injection · continuously updated the OWASP GenAI classification defining direct and indirect prompt injection and the named defense categories (filtering, content segregation, least privilege, human-in-the-loop)
Greshake et al.: indirect prompt injection (2023) · 2023-05-05 the canonical academic paper on indirect injection, framing the root cause as systems that blur the line between data and instructions
NIST AI 100-2e2025: adversarial ML taxonomy · 2025-03-24 a standards-body taxonomy of direct and indirect injection across LLMs, RAG, and agents, with named mitigations and their limits

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in