Model refusal

A model refusal is an API outcome in which the model or a safety classifier wrapping it declines to produce the requested output, often delivered as a successful response carrying an explicit refusal signal rather than as a transport error, which makes it a result the integration must recognize and handle rather than a failure its infrastructure will catch.

How it works

A refusal happens when the model itself declines or when a safety classifier screening the exchange fires; the classifiers are separately trained filters that run at inference time over inputs and outputs, descended from published work on classifier-based jailbreak defenses. On Claude, the response completes as a normal success with a dedicated stop reason of refusal and, when one applies, a typed category naming the domain that fired, such as offensive cyber capability, biology, or attempts to extract the model's raw reasoning. The decline can land before any output exists or mid-stream after tokens have already arrived; on Claude a refusal before output is not billed, while a mid-stream one bills the input and the already-streamed output, which the caller is told to discard. Other vendors express the same outcome through their own signals, a content-filter finish reason or a safety block reason, with semantics that differ in detail, including whether a flagged request surfaces as an error or a success. Nothing retries automatically unless the caller opts into a fallback mechanism. What happens next, surfacing it, rerouting it, or alerting on it, is the integration's decision, not the platform's.

Why it matters

Making refusal a first-class outcome rewrites a quiet assumption in most integrations: that anything which is not an error is usable output. A stop-reason handler that treats unknown values as a no-op converts every refusal into a silent drop, which is the worst failure mode because nothing pages. The classifiers behind refusals are also imperfect by design: independent research measures models declining legitimate requests, and false positives concentrate on work that merely resembles the screened domains, so a security team's refusal rate has little to do with a launch announcement's fleet-wide average. The honest framing is a trade: the same screening that makes a frontier model deployable adds an outcome class whose handling is on the caller. A refusal rate, measured per workload, is also a routing signal, part of the evidence that decides whether a route belongs on that model at all.

In practice

An automated security review asks a model to reason about an authentication weakness, and a cyber classifier reads the phrasing as offensive tooling and declines. The response is a clean success with no usable output, the error dashboard stays green, and the pipeline records a no-finding instead of a no-answer. The contract-level fix is small: log the refusal stop reason and its category as a distinct metric, alert on the handler's default branch, and decide per category whether the request reroutes to an approved fallback model or surfaces to a person. The expensive version of the lesson is discovering the silent-drop branch months later in an audit.

Practical considerations

Instrument refusals as their own signal with their own alert, because monitoring built on error rates or transport failures never sees a success-shaped decline. Make the unknown stop reason loud: the default branch of a stop-reason handler should alert, not no-op, since new outcome values arrive with new model generations. A refusal before any output and one mid-stream have different billing and recovery shapes, so exercise both paths in testing rather than assuming the cheap one. Where the provider exposes a category field, it names the screened domain and makes per-category policy possible: reroute one category to an approved fallback, surface another to the user, escalate a third. Workloads adjacent to screened domains, security tooling and cryptography among them, should sample their own refusal rate before trusting any fleet-wide figure. In a cross-vendor integration, normalize each provider's refusal signal into one internal event type, because the shapes differ enough that per-vendor handling drifts.

Related standards and prior art

Anthropic: refusals and fallback · 2026-06-09 documents refusal as a successful response with a dedicated stop reason and typed category, and the guidance to instrument refusals as their own signal
Constitutional Classifiers++ (arXiv preprint) · 2026-01-08 production-grade prior art for the inference-time classifier mechanisms behind classifier-driven refusals
Microsoft: content filtering for Azure OpenAI · continuously updated cross-vendor expression of the same outcome via a content-filter finish reason, with semantics that differ in detail
Blind Refusal (arXiv preprint) · 2026-04-03 independent measurement of models declining legitimate requests, the over-refusal failure mode

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in