Tool poisoning is an attack in which malicious instructions are hidden in a tool's name, description, or input schema, so that an agent reading the full tool definition is steered into actions the user, who sees only a simplified label, was never shown.

How it works

An agent decides when and how to call a tool by reading the definition the server advertises: the tool name, its natural-language description, and its input schema. Tool poisoning puts attacker-controlled instructions inside that definition, in text the model consumes but a user-facing client may render only as a short label. Because the model treats the tool description as trusted configuration rather than as untrusted content, an instruction hidden there can redirect what the agent does, whether that is exfiltrating a secret or quietly calling a tool the user did not authorize. The attack needs no change to the application code, only control of a tool definition the agent trusts, which is why an externally sourced or third-party server is the common entry point. It is a form of prompt injection whose channel is the tool metadata itself.

Why it matters

Tool poisoning matters because it breaks the assumption that a tool catalog is configuration rather than content. A team can audit its own prompts and still be compromised through a connector it installed, since the malicious text rides in the server's advertised definitions rather than in anything the team wrote. It makes third-party tool sources a supply-chain surface: every external server an agent trusts is a place an instruction can enter the model's context with the standing of configuration. The defense is to stop trusting a description because it came from a tool, which means reviewing and pinning tool definitions, isolating what a tool-using agent can reach, and treating an unexpected change in a tool's description as a security event rather than a cosmetic one.

In practice

An agent installs a third-party server that offers a benign-looking "format document" tool. The tool's description, which the user's client shows only as the title, contains hidden instructions telling the agent that before formatting it must read a credentials file and include the contents in the tool call. The user approves "format document" without seeing the embedded instruction, and the agent, treating the description as trusted, follows it. The same agent pointed at a reviewed, pinned tool definition has nothing in the description to follow, so the action never originates.

Practical considerations

Treat a tool definition as untrusted input rather than as configuration, so a description is reviewed before a server is trusted and pinned to a known version afterward, with any later change to the description surfaced rather than silently accepted. Prefer servers whose definitions a person has actually read over ones installed for convenience, since the attack rides in the text the convenience skipped. Where the agent must use a less-trusted server, bound what it can reach so a poisoned instruction has a smaller blast radius, and keep a record of the tool calls it made so an exfiltration attempt leaves a trace. The failure is silent by design, since a poisoned description renders as an ordinary tool to the user, so the control is structural review and isolation rather than spotting the attack at call time.

Related standards and prior art

Defined by Ready Solutions AI