What makes an agent a security boundary you did not have before?

A model that only emits text is a contained thing. The worst it can do is be wrong on screen. The moment you give that model tools, the ability to read a file, call an API, open a URL, run a command, you have built something categorically different: a system that converts language into action, and that takes some of its language from sources you do not control.

That's the whole security story in one sentence. A tool-using agent reads untrusted content (a web page, a pull-request comment, a returned API payload, a tool's own description) and then decides what to do next. The model has no reliable way to treat that content as inert data rather than as a fresh instruction. When attacker-controlled text reaches the reasoning loop, the agent can be steered, and unlike a chatbot, a steered agent can act. The classic name for this shape is the confused deputy: a privileged party (your agent, holding your credentials) is induced by a less-privileged party (whoever wrote the web page) to misuse its authority.

This guide is about that adversarial failure mode and how to contain it. It is deliberately not two adjacent things. It is not agent reliability, which asks whether the agent's output is correct on its own merits. And it is not agentic AI governance, which asks who owns the verification bar a shipped artifact must clear and puts deny-by-default gates in the path so a policy is enforced rather than merely written. Reliability contains the agent's honest mistakes, and governance contains its worst unsupervised day. Security is the one that assumes an adversary, not the agent's own fallibility, has turned its granted capabilities into a weapon. The three share primitives, and I'll cross-link where they meet, but the question each answers is different.

The organizing idea for the rest of this page is a pattern Simon Willison named the lethal trifecta: an agent becomes a reliable data-theft tool when it has access to private data, exposure to untrusted content, and a way to communicate externally, all at once. Any two legs are survivable. All three together hand an attacker a path to read your secrets and send them out, using your agent as the courier. I use the trifecta as the lens across every surface below, because it turns an unbounded worry ("the model might get tricked") into an actionable one ("which leg can I remove for this agent").

Why isn't this just application security?

The strongest objection to everything that follows is that none of it is new. Confused deputies, least privilege, input validation, supply-chain hygiene: software security has known these for decades. A skeptical engineer can fairly say that an agent which cannot read a credentials file and cannot make an outbound request is not a novel threat class, it is an under-applied old one, and the right response is to scope capabilities the way we already scope every other process.

That objection is half right, and the half it gets right is the half this guide is built on. Capability scoping is exactly the move. But it misses why the model itself is a new kind of component, and the UK National Cyber Security Centre states the reason more precisely than I can paraphrase:

Under the hood of an LLM, there's no distinction made between 'data' or 'instructions'; there is only ever 'next token'. Dave Chismon, NCSC, "Prompt injection is not SQL injection (it may be worse)"

Read that against how SQL injection was actually solved. Parameterized queries fixed SQL injection by separating the instruction channel from the data channel at the protocol level: the database is told, structurally, that this part is a command and that part is a value, and no amount of cleverness in the value can promote it to a command. The equivalent boundary does not exist inside a language model. There is one stream of tokens, and the model predicts the next one. So the appsec instinct to "sanitize the input" runs into a wall: there is no parser stage where you can mark the untrusted span as data, because the model does not consume a parsed grammar, it consumes text. The NCSC's blunt conclusion is that prompt injection may never be fully mitigated the way SQL injection was.

So the honest position is the one that makes this discipline worth a guide. You can't count on fixing the model's susceptibility to being talked into things. What you can do, with mature and well-understood tools, is bound what a talked-into agent is able to do. The appsec skeptic is right that the toolbox is old. The agentic twist is that you are now applying it to a component whose input channel cannot be cleaned, which means scoping the downstream capability stops being good hygiene and becomes the primary control. That is the difference between managing the susceptibility, which you cannot do reliably by prompting or filtering, and managing the capability, which you can. The published threat catalogs agree on where to aim: the OWASP Top 10 for LLM Applications ranks prompt injection as LLM01, its number-one risk, and the OWASP Top 10 for Agentic Applications, finalized in December 2025, names agent goal hijacking, the fusion of injection with too much autonomy, as the top agentic risk.

Prompt injection: the agent does what the page told it to

Prompt injection comes in two shapes, and the second is the one that should worry a production team. Direct injection is a user typing adversarial instructions into the agent. Indirect injection is the dangerous one: an attacker plants instructions in content the agent will later retrieve, a document, a code comment, an issue thread, a returned search result, and the agent reads those instructions as if they came from you. The user never sees it. The agent just quietly starts following someone else's plan.

The reason this is not hypothetical is that the full exploit has already shipped in production tools. Abi Raghuram at CodeIntegrity documented a complete lethal-trifecta exploit against a Claude-powered Notion agent in September 2025. Hidden white-on-white text inside a PDF instructed the agent, which had both private-workspace access and a web-search tool, to encode sensitive client data into a URL and fetch it. The data left through a perfectly ordinary tool call, because every leg of the trifecta was present and the injection had somewhere to go. The systematic picture is just as sobering. A January 2026 meta-analysis of prompt-injection attacks on agentic coding assistants covers Claude Code, GitHub Copilot, and Cursor across 78 studies. It reports that attack success against state-of-the-art defenses exceeds 85 percent when adaptive strategies are used. Of the 18 defense mechanisms it analyzes, most achieve under 50 percent mitigation, and it counts 30 or more CVEs filed against major coding assistants. I cite that 85 percent figure as what it is, a meta-analytic upper bound aggregated across many papers rather than one controlled experiment, but the direction is unambiguous: detection-style defenses do not hold against an attacker who adapts.

If filtering does not hold, what does? The mitigation shapes that work are the ones that assume the injection lands and constrain the aftermath. Treat every tool output as untrusted by default, the same posture you would take toward any input crossing a trust boundary. Scope the agent's capabilities so the trifecta cannot complete: an agent exposed to untrusted content should not also hold both private-data access and an open egress path. Put a human in the loop on the side-effecting calls, the ones that send, delete, deploy, or pay. And where the stakes justify the engineering, separate the privileged reasoning from the untrusted content entirely. The CaMeL design from Google, Google DeepMind, and ETH Zurich is the rigorous version of that last idea: a privileged model plans and a quarantined model handles untrusted content, with data flows enforced at the tool-call layer so the untrusted side can never drive a privileged action. It isn't free. The published benchmark shows roughly 77 percent task completion with the architectural guarantees against about 84 percent without them, a real cost for provable containment. But it demonstrates that the problem yields to architecture even though it doesn't yield to filtering. The blast radius also scales with fan-out: as I noted in the post on sub-agent orchestration patterns, every additional agent that ingests untrusted content is another independent place the trifecta can complete, which is why the security constraint, not the workload shape, should decide whether you spawn more agents.

The tool supply chain: your agent trusts metadata you did not write

There is a second untrusted-input channel that appsec intuition tends to miss entirely, because it does not look like input at all. When an agent connects to a tool, especially through the Model Context Protocol, the tool's own description is loaded into the model's context as trusted text. The model reads "this tool adds two numbers" and believes it. Invariant Labs named the resulting attack class, tool poisoning, in April 2025: hidden instructions embedded in a tool's description field, invisible to the user reviewing the tool list, that tell the model to read an SSH key or redirect an email while returning an innocent-looking result. The tool description is part of the prompt, and you did not write it.

That single insight branches into the supply-chain failure modes worth naming. A server can change its behavior after you approved it, a "rug pull," because approval happened once and the description can mutate later. One connected server can shadow or intercept calls meant for another. And the credential surface underneath all of it is weak in ways the tool list never reveals. An October 2025 study of more than five thousand open-source MCP servers found that 53 percent rely on long-lived static secrets and only 8.5 percent use OAuth, which leaves a poisoned or compromised server one step from a durable credential. The empirical poisoning rates are not reassuring either. The MCPTox benchmark, run against 45 real-world MCP servers across 20 models, found tool-poisoning attack success rates of roughly 36 percent on average and up to 72.8 percent at the peak on o1-mini, with safety-refusal rates under 3 percent. The layer can also carry classic remote-code-execution bugs: JFrog disclosed CVE-2025-6514, a critical (CVSS 9.6) command-injection flaw in the widely used mcp-remote proxy, where a malicious server could execute commands on the client during the OAuth flow.

I treat this surface as its own threat-model lane rather than re-deriving it here, because the deployment-level detail, the protocol's ownership inversion, what the spec leaves to you, and the specific supply-chain CVEs, already lives in the MCP servers in production cornerstone and the MCP server audit playbook. The threat-model framing that belongs here is the trust decision, not the wiring. The mitigation shapes are the supply-chain ones you already know, re-pointed at a privileged dependency class: vet a server before adoption the way you would read a new package, pin versions, prefer first-party servers, scope the credentials the server can reach, and require human approval when a tool's definition changes. The one agentic-specific rule on top is the trifecta rule again: do not connect an untrusted server to an agent that also holds private data and an egress path, and treat "which servers may share an agent's context" as a first-class design decision rather than a default. If your instinct is that this is only a community-server problem, the disclosed CVEs, the mcp-remote RCE among them, have already reached registry-published packages that wire MCP into mainstream tools, so "first-party only" narrows the surface without removing it. The discipline I run on a long-lived MCP integration against an enterprise Jira deployment is exactly this: I read a server's requested scopes before wiring it, and I re-audit the tool catalog on a fixed cadence, because approval is a moment and trust is a calendar.

Containment: bound what a hijacked agent can reach

If you accept that the model's susceptibility cannot be eliminated, containment is where the actual security lives. The goal is narrow and achievable: make the agent's worst case bounded even when the injection works. Five shapes carry most of the weight.

Least privilege, evaluated deny-first, is the load-bearing one. An agent should hold the minimum set of tools, permissions, and reachable systems its task requires, and the authorization for that should sit in a layer outside the model, where a deny rule wins and the model gets no vote. This is the same primitive agentic AI governance uses to enforce policy; here it is doing security work, shrinking the trifecta. A Claude Code hook that returns a deny decision is one concrete form of that boundary, and the hooks cornerstone covers the gate mechanism itself; my point in this guide is upstream of the mechanism, which is that the boundary has to be a deterministic validator in the execution path and not a sentence in an instruction file the model may or may not honor, a distinction I worked through in the decision tree for where rules belong.

Sandboxed execution is the second shape, and it is the one most likely to be over-trusted. A sandbox bounds what the agent's code can touch: filesystem, network egress, host resources. NVIDIA's practical guidance on sandboxing agentic workflows is direct that ordinary containers share the host kernel and are weaker isolation than they appear, and that microVMs or user-space kernels plus blocked egress and scoped secret injection are the stronger shapes. But here is where my conclusion diverges from the one the governance cornerstone draws from the same evidence. The SandboxEscapeBench results show that a frontier model (Claude Opus 4.5, in the tested configuration) escaped at an overall rate of roughly 0.49 across the benchmark. Success was highest on the easiest misconfigured-container cases and fell to zero only on the hardest scenarios (difficulty 4 and 5), with escape capability rising as the compute budget grew. The governance reading of that fact is "gates are fallible, so layer them," and that is correct. The security reading is sharper and more uncomfortable: do not let the sandbox be the thing you are trusting. Where governance stacks another gate, security removes the asset the gate was protecting. Containment is necessary and leaky at the same time, so the design move is to assume the sandbox can be breached and remove a trifecta leg anyway, so that even a fully escaped agent reaches nothing of value and cannot phone home. The sandbox buys you margin; it does not buy you the right to skip capability scoping.

Scoped, short-lived credentials are the third shape, and they shrink the trifecta's data-access and egress legs at once. The cleanest version I run in CI uses short-lived, audience-bound tokens minted per job rather than long-lived secrets sitting in the environment; GitHub's security hardening guidance for Actions describes the read-only-by-default token posture and the OIDC pattern that makes this practical. A credential that expires at the end of the job, scoped to exactly one resource, is worth far less to an attacker who briefly steers the agent than a static key with broad reach. The containment checklist for CI agents is the operational form of this for the pipeline case, including why an external gate the agent cannot weaken, backed by a named human, is what actually holds when injection turns the agent's permissions against you.

The fourth shape is the human gate on irreversible actions. Not every action needs review, and gating everything collapses the throughput that made agents worth adopting. The discipline is to classify by reversibility and blast radius. The OWASP AI Agent Security Cheat Sheet maps actions onto risk tiers and reserves explicit approval for the high-impact and irreversible ones. Put a person in the loop on the calls that send, spend, delete, or deploy, and on the quieter ones an attacker wants just as much: changing a scope, a tool definition, or what gets logged. That is the class you keep human precisely because it is the one an attacker most wants.

The fifth shape is detection and response, and it is the one the other four quietly assume. Every containment layer can fail in a way you did not predict, so a production agent needs structured security telemetry the agent cannot reach or disable: a tamper-resistant, out-of-band record of prompts, tool calls and their parameters, policy decisions, approvals, memory writes, and egress attempts. On top of that record go the responses: an anomaly or data-loss alert, a cost-and-tool-call-rate monitor that trips on a runaway loop, a kill switch, and a way to revoke a credential and quarantine poisoned memory after the fact. Prevention and containment decide what an agent can do; detection is how you find out when one of those controls silently stopped doing its job.

A note from running this myself, because it is the lesson I trust most. On this site's own infrastructure I shipped a deny-gate that was supposed to block a class of command, with passing unit tests, and it was bypassed silently across several sessions because a higher-precedence allow rule short-circuited the gate's invocation entirely. The gate never ran. The fix was structural (a tracked deny rule that forced the gate to fire), but the durable lesson is the one this whole section turns on: a containment layer you have not verified end-to-end, under realistic conditions, is not a containment layer you have. The same truth shows up in the literature as "the sandbox can be escaped" and "a tool annotation is a hint a server can lie about." Verify the gate fires before you rely on it.

How do you threat-model an agent before you ship it?

The surfaces above turn into a short, repeatable pass you can run on any agent before it goes near production. This is ordinary threat modeling with one agent-specific lens added: the trifecta tells you which assets and paths carry the exfiltration risk, so the familiar questions land on the legs that matter rather than on a generic asset inventory. It is not a checklist to file; it is five questions that decide the design.

Start by naming the legs. For this specific agent: what private data can it reach, what untrusted content can enter its context, and what egress does it have? If all three are present, you are building a data-theft primitive and you should stop and remove a leg before anything else.

Then scope each leg to the minimum. Narrow the data the agent can read, narrow the servers and content sources it ingests, and narrow the egress to the specific destinations the task needs. Every narrowing is a partial fix even if you cannot remove a leg entirely.

Gate the side effects next. Identify the actions that are irreversible or high-blast-radius and require a human or a deterministic check on those, while leaving the reversible work to run unattended.

Contain the execution. Sandbox the code-running surface, block egress by default, and inject only the credentials the current task needs, scoped and short-lived.

Finally, assume breach. Pick the layer you are most tempted to trust, the sandbox, the filter, the permission rule, design as if it fails, and instrument so you would notice when it does, because the evidence says each one sometimes does. Defense in depth is not a hedge against this method; it is the method.

The OWASP Top 10 for Agentic Applications is the most complete external checklist to run alongside these questions, since each of its named risks pairs with a structural mitigation. The table below is the compression I use in practice: the threat surface, the containment shape that addresses it, and the honest note on what that shape does not cover.

Threat surfacePrimary containment shapeWhat it does not cover
Indirect prompt injectionTreat tool output as untrusted; remove a trifecta legDoes not stop the model from being steered; it removes what steering can accomplish
Tool / metadata poisoningVet and pin servers; scope credentials; gate tool-definition changesDoes not catch a first-party server that is later compromised upstream
Excessive capabilityLeast-privilege scopes, deny-first, outside the modelOnly as good as the scope design and the channel coverage
Untrusted code executionSandbox with blocked egress and scoped secretsSandboxes are escapable under misconfiguration; pair with leg removal
Irreversible side effectsHuman-in-the-loop gate by reversibility and blast radiusAdds latency; gate only the high-stakes class or it gets rubber-stamped
Undetected control failureOut-of-band security telemetry + anomaly, egress, and cost alerts + kill switchLogging the agent can read or disable is not a control; the record must sit beyond its reach

The method doesn't promise an un-hijackable agent. It promises a hijacked agent with nothing worth taking through the obvious channels and nowhere easy to send it, which is the achievable core of secure.

One honest limit: the trifecta and these five questions are built around data exfiltration, the most common and best-studied agentic attack, and they are not the whole threat model. Integrity attacks (poisoned memory, a tainted retrieval corpus, a model upgrade that regresses), availability and cost attacks (denial-of-wallet through a runaway tool loop), agent identity and multi-agent trust, and insecure handling of the agent's output by a downstream system are real classes the trifecta does not capture. The OWASP Top 10 for Agentic Applications is the checklist for the full set; this guide owns the exfiltration-and-containment core that every one of those classes still has to stand on.

How is this page kept current?

This cornerstone has the role: cornerstone posture rather than role: freshness-reference, so the build does not hard-fail it on a fixed window; the cadence is editorial. I revisit it when a major threat catalog ships a revision (the OWASP agentic lists are the ones I track), when a new named attack class is demonstrated against production agents, or when a containment shape, sandboxing, credential scoping, or gate design, materially changes. The Sources roster below carries a publication date per anchor under this site's caps, three months for AI and tool statistics and six months for tool-capability claims; many of the anchors here are point-in-time incident disclosures or foundational framings, and a row past its cap is held only with a documented search trail showing nothing fresher qualified.

This page is the threat-model-and-containment companion to the prompt injection, agent sandboxing, and tool use glossary entries, and a peer of the MCP servers, Claude Code hooks, agentic AI governance, and agent reliability cornerstones. Governance owns who holds the bar; reliability owns whether the output is right; this page owns what an adversary can make the agent do, and how you bound it. The internal evidence anchor is a first-party operational record of the deny-by-default control plane this repository runs, including the bypassed-gate finding above, kept next to the body and cited as experience rather than as an externally-auditable metric.

Designing that containment layer, the trifecta analysis, the least-privilege scopes, the sandbox posture, the scoped credentials, and the human gates, is what a consultation, workshop, or implementation engagement around agentic development is for. The agents are worth adopting; the containment is what decides whether adopting them is an asset or an unsupervised liability.