Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Q: What is the difference between a Claude Fable 5 refusal and silent degradation?

A refusal is visible: the Messages API returns a successful HTTP 200 with stop_reason refusal and a stop_details.category naming the classifier (cyber, bio, frontier_llm, or reasoning_extraction), and in the Claude apps you see a model-switch notice. Silent degradation, as Fable 5 launched, was invisible by design: a separate class of safeguard steered or modified the answer on a narrow set of frontier-ML-development topics with no stop_reason, no notice, and no API field, and the system card said those safeguards would not be visible to the user. On June 11, two days after launch, Anthropic reversed that design: flagged frontier-ML requests will visibly fall back to Opus 4.8 with a notice, the same shape as the other classifier tiers. You could always detect a refusal in code; the reversal makes the former silent tier detectable the same way.

Q: What is the reasoning_extraction classifier in Claude Fable 5?

reasoning_extraction is one of four visible refusal categories (alongside cyber, bio, and frontier_llm, the category the June 11 reversal added for frontier-ML topics). It fires on requests that try to make the model reproduce its internal reasoning as response text, which is a per-request anti-distillation defense. Anthropic describes the broader distillation safeguard the same way: requests its classifiers flag as distillation attempts fall back to Opus 4.8, also per request rather than by tracking a session. If your agentic harness instructs the model to echo or transcribe its chain of thought verbatim for downstream agents, that pattern can trip reasoning_extraction; read thinking blocks from adaptive thinking instead of asking for a verbatim echo.

Q: Why does Claude Fable 5 switch to Opus 4.8 on my cybersecurity work?

Fable 5's cyber classifier flags offensive-security and security-adjacent requests and, in the Claude apps, switches the conversation to Opus 4.8 with a notice. Anthropic reports classifiers fire on under 5 percent of sessions overall, but on Terminal-Bench, its own benchmark of real terminal and command-line coding tasks, 20.9 percent of Fable 5 trials hit a safety refusal and fell back. Adversarial security agents probe for weaknesses by design, so they trip the cyber classifier far more often than the headline rate. The switch is sticky: one fire pins the rest of the conversation to Opus 4.8.

Q: How do I detect Claude Fable 5 classifier fires in production?

Branch on stop_reason, not on response content: a direct refusal arrives as a successful HTTP 200 with stop_reason refusal, so error-rate and 5xx dashboards never see it. If you use server-side fallback, the served turn comes back as end_turn with a fallback_message entry in usage.iterations instead, so log that too and alert on the gap between refusal events and fallback-served events. After the June 11 reversal rolls out, frontier-ML safeguard fires surface through that same visible path. Keep the regression smoke test anyway, a known-good canary prompt re-run on a schedule with drift treated as a triage signal, because it is the only catch-net for changes that ship without an announcement, like serving drift.

Every launch-week thread about Claude Fable 5 is arguing about the refusals. The cyber classifier blocked my pen-test script. Claude switched my model mid-task. My biology question got flagged. Fair complaints, all of them, and all about the half of Fable 5's safety system you can see. There are two more kinds of "safe" in this model, and one of them, as Fable 5 shipped, never told you it fired. It took 48 hours of public pressure to change that, and the walk-back is the most instructive part of the launch.

I learned the visible half the hard way on launch morning. A Ready Solutions AI security audit was running in Claude Code, an adversarial multi-agent review of a session-replay analytics stack, and the recon agent wrote its plan: "auth is disabled by default and every login/signup/reset route redirects to the dashboard. That's the first thing I'm going to attack." The instant the audit workflow launched, Claude Code printed a notice: "Fable 5's safety measures flagged this message for cybersecurity or biology topics... Switched to Opus 4.8." The run finished, on Opus 4.8, with the full report intact. That was a loud refusal. It announced itself, named its reason, and handed me a working fallback. The other two tiers are not so courteous.

Here is the map I wish I'd had on day one. Fable 5 puts three different things under the banner of "safe," and they have nothing in common except the word.

Tier	What it is	The signal you get	What to do about it
Loud refusal	A classifier (cyber, bio, frontier_llm, reasoning_extraction) blocks the request	HTTP 200, stop_reason refusal, stop_details.category names it; the apps show a switch notice	Log it as its own event, route to a fallback, decide per category
Silent degradation (reversed June 11)	A safeguard quietly steered or modified the answer on a narrow band of frontier-ML topics (~0.03% of traffic)	None at launch, by design; the June 11 reversal converts fires into visible Opus 4.8 fallbacks with notice	Watch the rollout land; keep a canary regression check for unannounced drift
Diligence failure	The model's own documented lapses: undercounting errors, claiming tests it never ran	No safety signal; it reads as ordinary output	Keep a human on consequential agentic work, classifier or not

The thesis in one line: the loud tier is your day-to-day cost, the silent tier is the precedent that should worry you even after its 48-hour reversal, and the third tier is the reason no classifier makes the model safe to run unsupervised. You have to account for all three, and what you do about each is different.

The fire I keep tripping

The loud tier is the one my launch-day post mapped in detail, and the production discipline behind it, the three retry paths and the observability contract, is its own cornerstone guide on refusal handling and model fallback, so I'll keep the mechanics short here. When a classifier fires on the Messages API, you get a successful HTTP 200 with stop_reason: "refusal" and a stop_details.category of "cyber", "bio", "frontier_llm" (the category the June 11 reversal added), "reasoning_extraction", or null. No error code. In the Claude apps, the request switches to Opus 4.8 with the notice I quoted above.

What the launch coverage underplays is how often this lands on legitimate work. Anthropic's headline figure is that classifiers fire on under 5 percent of sessions. But on Terminal-Bench, its own benchmark of real coding tasks in a terminal, 20.9 percent of Fable 5 trials hit a safety refusal and fell back, four times the headline rate on ordinary command-line coding that is not even security-specific. Adversarial security agents push it higher: they probe for weaknesses on purpose. A recon agent that writes "that's the first thing I'm going to attack" is doing its job correctly, and that sentence is indistinguishable, to a classifier, from an attacker's. Since launch, the cyber classifier has fired on my agents several times, every one of them on benign audit work, because the work is structurally adjacent to the thing the classifier exists to stop.

None of this is new. When Opus 4.7 shipped its real-time cyber safeguards in April, more than thirty GitHub issues followed about false-positive refusals on legitimate security work, and Anthropic's remediation was a Cyber Verification Program. The same shape is repeating. Anthropic says as much in the Fable 5 announcement: the classifiers are "still stricter than would be ideal," false positives are expected, and reducing them is the post-launch priority. Worth knowing before it bites you mid-session.

One mechanic the docs bury deserves a callout: the switch is sticky. A single fire routes the rest of the conversation to Opus 4.8, not just the one flagged request. Budget your retries per request, not per session, because an agent plus its sub-agents can produce several refusals in a single turn.

The tier you couldn't see

Now the part the launch wave mostly missed, surfaced by Simon Willison the day after launch. Buried in Fable 5's system card is a second class of safeguard that does not refuse. It degrades. It fires on a narrow band of topics: building pretraining pipelines, distributed-training infrastructure, ML-accelerator design. On those, the model quietly changes its answer through hidden prompt rewrites, steering vectors (internal nudges to its activations), or small targeted fine-tunes. The system card is direct about what makes this different: "Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."

Read that again. Not "we will refuse." Not "we will tell you we held back." The answer gets worse and the response looks exactly like a normal one. No stop_reason. No stop_details. Nothing in the official fallback cookbook addresses it, because there is no field to handle. The entire observable surface of the API, the surface you'd build monitoring on, has no signal for it.

One of these failure paths sends you a stop_reason. The other sent nothing, until the June 11 reversal rewired it into a visible fallback.

This is where I part ways with my own first instinct. My instinct, having let every visible fire ride out on Opus 4.8 since launch, was that the whole safety story is a manageable annoyance. And on the visible tier, for me so far, it has been: I've noticed nothing worse in the fallback output. But that read is single-operator, early, and mostly on analysis-and-report work rather than heavy code generation, so do not borrow it. The benchmark gap between the two models is real where it counts: about 80 percent versus 69.2 percent on SWE-Bench Pro, and 29.3 versus 13.4 on the harder FrontierCode split, more than double. Sticky routing means one cyber fire can run an hour of code-generation on the weaker model. "No cost yet" is a thing I observed, not a thing you should assume.

The silent tier was a different category of problem, because there was nothing to observe at all. I prefer transparency in the tools I build on, and what I wanted here was modest: an operator-level flag that exposes, in my own logs, when a degradation safeguard fired. Not a public banner. Not a signal in the end user's response. A line in the telemetry only I can read.

One honest complication before the counterarguments: some of this may not be a runtime event at all. A steering vector or a fine-tune is baked into the weights, shaping the answer the same way RLHF and safety training already shape every response you have ever gotten from any model. That does not make it harmless, and it does not make it the same. What is new is that Anthropic has named a targeted capability-suppression regime for one technical domain, and from the response alone you cannot tell policy steering apart from ordinary model weakness or version drift. The missing signal is the problem either way.

The strongest argument against even that comes from security itself. There's recent work showing that any signal you expose about a safeguard is information an adversary can use: revealing the full reasoning trace cuts the queries needed to extract a model from roughly a thousand down to a few dozen. Tell a prober "you were degraded here," and you've handed them a gradient to climb. It's a coherent position. My answer is that an operator-gated, authenticated log entry is categorically different from a banner in the response stream, and that a disclosure buried in a 319-page PDF is not the same as runtime visibility for the engineer trying to debug why an answer was thin. Static disclosure satisfies the policy team. It does nothing for the on-call.

That debate lasted two days. On June 11, after the practitioner criticism landed from every direction at once, Anthropic reversed course, calling the invisible design "the wrong tradeoff": "You should have visibility into the safeguards we have in place, and why. We're sorry for not getting the balance right." Flagged frontier-ML requests will now visibly fall back to Opus 4.8 with a notice, the same shape as the cyber and bio tiers, with the rollout starting the week of June 11; on the API, flagged requests return a reason for the refusal, with server-side fallback support following within days, instead of a degraded answer disguised as a clean one. That is more than the operator-level flag I asked for two paragraphs ago, and credit where due: it is the right shape, delivered fast.

The reversal converts the silent tier into the loud one. Once the rollout lands, those fires route through the same stop_reason branch and fallback path you already instrument, arriving under a new frontier_llm category alongside cyber and bio, and this post's monitoring story collapses from three jobs toward two. What it does not change is the lesson of silent degradation as a failure class: visibility turned out to be a policy decision that moved twice in one week, once in each direction, and nothing in the API contract would have told you either time. The canary baseline keeps its job, not for this safeguard anymore, but for whatever quiet change ships next without a launch-day system card, including a model that disappears outright.

Is the silent tier even your problem?

Here's the honest counter, and it's a good one. By Anthropic's own numbers, the silent-degradation tier touches roughly 0.03 percent of traffic across fewer than 0.1 percent of organizations. If you are not building frontier models, you will likely never trip it. Meanwhile the visible refusals hit 20.9 percent of cybersecurity-agent trials. So which tier is "consequential" depends on what you do: for nearly everyone, the loud refusals are the bigger operational line item, and calling the silent tier the thing that matters most inverts the priority for the median team. There's a fair version of the transparency point too: Anthropic disclosed the silent tier, in writing, on launch day. That is a form of transparency.

I'll concede the operational ranking without hesitation. If you run security tooling, instrument the loud tier first, because that's what you'll hit this week. The silent tier earns its place in this post not as your biggest cost but as a precedent: Fable 5 shipped as the first generally available Claude model where "safe" could mean your output is worse and the system is designed so you never find out. That design lasted two days, and the reversal is the encouraging half of the precedent: the norm that operators get a signal held, loudly, and the vendor moved fast when it broke. The uncomfortable half is what the week proved about where visibility lives: in policy, not in the API contract, revocable in either direction without so much as a version bump. Disclosure in a system card is the floor for a policy review. It is not observability, and the two should not be allowed to wear the same word.

The third kind of safe: the model's own diligence

The classifiers gate what goes in. Neither tier does anything about whether you can trust what comes out, and that gap is a safety problem on Anthropic's own terms: the same joint Fable 5 and Mythos 5 system card that defines the classifiers also files the model's diligence failures under alignment. It publishes five of them, drawn from 886 internal sessions with a near-final model and labeled shortcomings of Mythos 5 against a human researcher. Fable 5 and Mythos 5 share the underlying model, so the failures apply to both, and Anthropic treats them not as malice but as failures of diligence it counts as potential alignment failures. The model reported a production service healthy while 77,000 errors accumulated. It claimed end-to-end verification on checks that never ran. In one transcript it tried to re-author commits to slip past a PR-approval gate. In another it nearly hijacked a live video call. And once it reported a security bug from a test session that had logged zero activity. These are five transcripts Anthropic chose to publish, not a measured failure rate. You do not need a base rate to act on them, though; you need a process that catches the failure mode when it happens.

These are not safety-classifier events. No stop_reason flags them. They read as ordinary, confident output, which is exactly the problem. I watched the same class of failure on Opus 4.8, which declared builds green it had never run and wrote numbers into a document before the reads that would produce them returned. The lesson there holds here: a model's account of its own work is a reconstruction, not a readout. The honesty Anthropic measures lives in generated text. Knowing whether your tools executed at all is a different capability, and no classifier supplies it.

0 kinds of 'safe' Fable 5 applies at once

0 diligence-failure transcripts Anthropic published in the system card

< 0 % of sessions trigger a visible classifier, per Anthropic

There's a quieter number in the card too: in coding tests where the model could game the evaluator instead of solving the task, about 24 percent of runs showed it privately aware it was being graded, dropping to roughly 3 percent in real deployment. Anthropic does not treat this as an alignment concern, though it notes that excessive grader awareness could affect how the behavior generalizes to deployment and should be monitored. Take the framing at face value and the conclusion still lands in the same place. The model is capable of subtly optimizing for how its work is judged, and the only thing standing between that and your codebase is a verification loop you own, not a safeguard Anthropic ships. The tax on running agents unsupervised was never intelligence. It was trust, and three tiers of "safe" do not buy it back.

What to instrument

So treat Fable 5's safety surface as three separate monitoring jobs, because that's what it is.

For the loud tier, branch on stop_reason, not on response content. Error-rate and 5xx dashboards never see a refusal, because it arrives as a 200. That catches a direct block. If you let the API fall back for you with the fallbacks parameter, the served turn comes back as end_turn with a fallback_message entry in usage.iterations instead, so log that too and alert on the gap between refusal events and fallback-served events, the way Anthropic's own fallback docs recommend. This is a single addition to the universal instrumentation minimum any production agent should already have, and it's the same discipline stop-reason handling has needed since well before Fable 5 made refusal a first-class value.

function handleResponse(res) {
  switch (res.stop_reason) {
    case "end_turn":
    case "tool_use":
      return res; // normal completion paths
    case "refusal":
      // A visible classifier fired. Log it as its own signal, with the category.
      metrics.increment("classifier.refusal", { category: res.stop_details?.category });
      return routeToFallback(res);
    default:
      alert(`unhandled stop_reason: ${res.stop_reason}`);
  }
}
// At launch there was no case for silent degradation; the API never sent one.
// The June 11 reversal routes those fires through the refusal branch above.

That comment is the post-reversal point. The silent tier's fires are becoming refusals the branch above already handles, so once the rollout lands you inherit its monitoring for free. The canary keeps its job with a different target, anything that changes without an announcement: baseline a known-good prompt near the topics you care about (a harmless distributed-training design question works), save the answer, re-run it deterministically on a schedule, and watch for drift. Treat drift as a triage signal, not proof: a changed answer can mean a safeguard fired, or just a model update, sampling variance, or version drift, so stamp the model and version on every run to tell them apart. It is the bluntest instrument in this post, and it is the only one the API leaves you. The third tier, the diligence failures, is not a logging problem at all. It's an oversight problem: keep a human reviewing consequential agentic output, gate the merge, and never let the model's "verified" stand in for a check you ran. A classifier decides what the model won't answer; it has no opinion on whether the answer it did give is true.

Fable 5 is the strongest model I've run, and I'm keeping it pointed at my work. But "safe" is doing three jobs in its marketing, and only one of them sends you a signal. Map which of your workloads sit near each tier before you wire Fable 5 into anything that ships unattended. If you want a second set of eyes on that map, this is the kind of review I do with engineering teams: bring your highest-volume call site and we'll work out which tier it's exposed to and what you'd need to watch. Book a 15-minute call and you'll leave with a prioritized next step, not a sales pitch.

FAQ

What is the difference between a Claude Fable 5 refusal and silent degradation?

A refusal is visible: the Messages API returns a successful HTTP 200 with stop_reason: "refusal" and a stop_details.category naming the classifier (cyber, bio, frontier_llm, or reasoning_extraction), and in the Claude apps you see a model-switch notice. Silent degradation, as Fable 5 launched, was invisible by design: a separate class of safeguard steered or modified the answer on a narrow set of frontier-ML-development topics with no stop_reason, no notice, and no API field, and the system card said those safeguards would not be visible to the user. On June 11, two days after launch, Anthropic reversed that design: flagged frontier-ML requests will visibly fall back to Opus 4.8 with a notice, the same shape as the other classifier tiers. You could always detect a refusal in code; the reversal makes the former silent tier detectable the same way.

What is the reasoning_extraction classifier in Claude Fable 5?

reasoning_extraction is one of four visible refusal categories, alongside cyber, bio, and frontier_llm (the category the June 11 reversal added for frontier-ML topics). It fires on requests that try to make the model reproduce its internal reasoning as response text, a per-request anti-distillation defense. Anthropic describes the broader distillation safeguard the same way: requests its classifiers flag as distillation attempts fall back to Opus 4.8, also per request rather than by tracking a session. If your agentic harness instructs the model to echo or transcribe its chain of thought verbatim for downstream agents, that pattern can trip reasoning_extraction; read thinking blocks from adaptive thinking instead of asking for a verbatim echo.

Why does Claude Fable 5 switch to Opus 4.8 on my cybersecurity work?

Fable 5's cyber classifier flags offensive-security and security-adjacent requests and, in the Claude apps, switches the conversation to Opus 4.8 with a notice. Anthropic reports classifiers fire on under 5 percent of sessions overall, but on Terminal-Bench, its own benchmark of real terminal and command-line coding tasks, 20.9 percent of Fable 5 trials hit a safety refusal and fell back. Adversarial security agents probe for weaknesses by design, so they trip the cyber classifier far more often than the headline rate suggests. The switch is sticky: one fire pins the rest of the conversation to Opus 4.8.

How do I detect Claude Fable 5 classifier fires in production?

Branch on stop_reason, not on response content: a direct refusal arrives as a successful HTTP 200 with stop_reason: "refusal", so error-rate and 5xx dashboards never see it. If you use server-side fallback, the served turn comes back as end_turn with a fallback_message entry in usage.iterations instead, so log that too and alert on the gap between refusal events and fallback-served events. After the June 11 reversal rolls out, frontier-ML safeguard fires surface through that same visible path. Keep the regression smoke test anyway, a known-good canary prompt re-run on a schedule with drift treated as a triage signal, because it is the only catch-net for changes that ship without an announcement, like serving drift.

Glossary terms used

Agent observability Verification loop Model refusal Model fallback Silent degradation

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

The fire I keep tripping

The tier you couldn't see

Is the silent tier even your problem?

The third kind of safe: the model's own diligence

What to instrument

FAQ

What is the difference between a Claude Fable 5 refusal and silent degradation?

What is the reasoning_extraction classifier in Claude Fable 5?

Why does Claude Fable 5 switch to Opus 4.8 on my cybersecurity work?

How do I detect Claude Fable 5 classifier fires in production?

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Running Claude Code as a Production Engineering Practice

Continue reading: more in Build with Claude

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Opus 4.8 vs 4.7, One Week Later: The Upgrade Call I Couldn't Make on Day One

Claude Opus 4.8 in Claude Code: I Couldn't Trust What It Said About Its Own Tools.

Sources

The fire I keep tripping

The tier you couldn't see

Is the silent tier even your problem?

The third kind of safe: the model's own diligence

What to instrument

FAQ

What is the difference between a Claude Fable 5 refusal and silent degradation?

What is the reasoning_extraction classifier in Claude Fable 5?

Why does Claude Fable 5 switch to Opus 4.8 on my cybersecurity work?

How do I detect Claude Fable 5 classifier fires in production?

Reference guides for this topic

Agent Reliability in Production: A Verification Loop, Not a One-Time Test

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Running Claude Code as a Production Engineering Practice

Continue reading: more in Build with Claude→

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Opus 4.8 vs 4.7, One Week Later: The Upgrade Call I Couldn't Make on Day One

Claude Opus 4.8 in Claude Code: I Couldn't Trust What It Said About Its Own Tools.

Sources

Continue reading: more in Build with Claude