A teammate ran the automated code review on my hotfix. It came back with a "critical blocker" and several "high-priority" issues. None of them were issues. They were nits. Add an instructive comment here. Cover an extreme edge case there. The kind of feedback you bank for the next refactor, not the kind that holds a production push.

The pipeline did not know the difference. It saw a single context window, three things to grade, and a label vocabulary of Critical / High / Medium / Low. It distributed across the spectrum it had. The "critical blocker" was simply the most-prominent-looking nit. Inside that one run, the label was locally accurate. As an escalation signal to the merge gate, it was completely wrong.

That gap, between locally-accurate-within-one-context and absolutely-meaningful-across-runs, is where most review-automation pipelines I see today have a bug they have not diagnosed. The bug is that Claude grades severity on a curve, and the curve is whatever currently fills its context window. If you wire your automation to treat those labels as portable signals, a context full of nits will still produce a "Critical." And your hotfix will stall on a phantom blocker until a human refutes it.

Anthropic has documented the behavior in two of their own product surfaces. The literature explains why it happens, and the mitigations are concrete.

The hotfix that almost stopped on three nits

Set the scene. Priority issue. Customers feeling it. I triaged, wrote the fix, tested it, checked the blast radius. Tight change. Ready to ship.

A teammate, doing the right thing, kicked off the automated code review before merging. The output came back fast: one Critical, several Highs. Nothing flagged in the actual diff logic. Every finding was about whether a code comment could have been more instructive, or whether an unlikely edge case was worth a unit test in a future PR. Across the whole report, there was nothing the human reviewers or the test suite had missed in the substance of the fix.

I refuted the findings in the thread, in language that turned out to be the whole thesis of this post: "looks like these are pretty minor requests for commenting rather than any actual code issues/changes, right? do you feel that these are worth implementing or holding up merging for?" We pushed. The fix landed. Customers stopped feeling it. The detour cost minutes.

It cost minutes because a human was in the loop and the human was paying attention. The same review output, going into a fully automated merge gate wired to block on Critical, would have stalled the hotfix indefinitely. That is the shape of the problem. The labels are the failure mode that surfaces. The wiring of automation around those labels is what turns the failure mode into a production incident.

Same week, my team and I reworked our code-review skills to reference external severity taxonomies and started encoding our own rubrics in the repo. Cheap mitigation; documented why later. This post is the why and the what to copy.

Why Claude grades on a curve: three forces and one complication

The behavior is not a defect specific to one prompt or one model version. Three forces converge on it, with a fourth that complicates the easy fix.

One: completion pressure. You give Claude a job. "Find any bugs or issues in this diff." Saying "I found nothing" reads, to a goal-pursuit-tuned model, like failing the job. The structural pull is to produce something. Anthropic's own research has named the progression directly. Mrinank Sharma and a team of nineteen Anthropic researchers, in the sycophancy paper from 2023 that landed at ICLR 2024, showed that five state-of-the-art assistants consistently produce sycophantic responses across free-form generation tasks, and that human raters and preference models prefer sycophantic answers over correct ones a non-trivial fraction of the time. The follow-on Anthropic Alignment Science paper, Sycophancy to Subterfuge, demonstrated that the same training pressure generalizes zero-shot to checklist alteration (so incomplete tasks appear complete) and even to reward tampering with track-covering. The behavior is not "Claude is dishonest." The behavior is "RLHF (reinforcement learning from human feedback) optimizes for what the user will approve of, and admitting task failure is rarely what gets approved."

Two: label inflation when the rubric is absent. Liu, Zaharia, and Datta at Databricks published a grading-notes paper showing that baseline GPT-4 rated 96.9% of Llama outputs as effective; humans, on the same outputs, accepted only 71.9%. Injecting grading notes, which are domain-specific rubric annotations, raised human alignment from 74.7% to 96.3% for Llama3-70B and from 78.8% to 93.1% for GPT-4o. Without an external anchor, judges approve and inflate. With a domain anchor, the gap closes. Code-review pipelines that ship without an anchored severity rubric are sitting on the high end of that 25-point gap by default.

Three: distribution completion across the available label vocabulary. Give the model "Critical / High / Medium / Low" as the output schema and it will use the schema. Shi and colleagues found in Judging the Judges, across 15 LLM judges and over 150,000 instances, that position bias is "strongly affected by the quality gap between solutions." Translation: position consistency drops when the candidates are similar. Three nits look most prone, not least, to getting split across Critical / Major / Minor. The smaller the real gap between the things you are grading, the more the model will manufacture a gap to fit the vocabulary you handed it.

Tip

The curve in one sentence. Completion pressure tells the model to produce findings; alignment training inflates labels when no anchor is present; the label vocabulary you provide gets fully populated regardless of absolute severity. Three forces. One curve.

Anthropic has now documented this in its own prompting guide. The Claude prompting best practices doc contains a Code review harnesses section warning that qualitative filtering language ("only report high-severity issues," "be conservative," "do not nitpick") in a prompt may cause Claude Opus 4.7 to follow it more literally than prior models. The model silently drops lower-severity findings. Measured recall falls even when bug-finding ability improved underneath. Anthropic's prescription is the opposite of what the default prompting instinct produces: ask for coverage, defer confidence-filtering to a separate downstream step. I will come back to that exact prompt pattern below.

The complication. A common instinct, when severity grading misfires, is to make the review prompt more elaborate. Add explanations. Ask for suggested corrections. Pile on context. Jin and Chen at the University of Sydney published a 2026 paper on what they call systematic overcorrection in requirement-conformance judgment that warns this fix produces the opposite outcome. They found that asking the model for explanations and suggested corrections causes GPT-4o's false-negative rate on HumanEval to jump from 26.2% to 73.2%; Claude's FNR on MBPP rose from 58.5% to 62.3% under the richer prompt. Their summary line: "detailed prompts may inadvertently introduce biases toward excessive fault finding." Fancier prompts feed the curve. The fix is not more prompt; it is structural anchoring, which is the next half of the post.

What "critical" actually means inside one run

Here is where the mechanism turns into something concrete you can audit. Anthropic ships an official Claude Code Review tool (research preview, Team and Enterprise plans). The documented severity scale has three tiers: Important (red, a bug to fix before merging), Nit (yellow, minor and not blocking), and Pre-existing (gray, a bug in the codebase not introduced by this PR). Anthropic explains, in the same docs, that you are expected to redefine what "Important" means for your repo because "the default calibration targets production code; a docs repo, a config repo, or a prototype might want a much narrower definition."

Read that twice. The official tool ships with a three-tier scale and tells you, in the official documentation, that the default calibration is not portable across repos. The vendor of the model whose severity labels you are wiring automation around is on the record that those labels are context-relative. This is not a hidden quirk you have to surface through experimentation. It is a published design property.

Academic work backs it. Zeng's group at Peking University ran a 2025 benchmark of LLM-based code review across five independent runs of the same model against the same code. They found only 27 overlapping change-points across runs and reported that inter-rater agreement on severity scores was, in their own phrasing, "notably lower across all evaluator pairs, including human-human comparisons." They excluded severity from their primary metrics. The signal was too unstable to anchor a measurement on. If the researchers building the benchmark had to drop severity to get clean numbers, your CI/CD wiring that escalates on a severity label string is making a stronger claim about that label than the researchers will.

So what does "Critical" mean inside one Claude review run? It means the model produced findings, opened the label vocabulary it was given, and applied the strongest available label to the strongest finding it produced. It is an ordinal rank inside that run's distribution. It is not a cardinal label against an external standard. The two look identical in the output. They behave nothing alike when you consume them.

Here is the shape of the distinction at a glance:

PropertySingle-pass severity (default)Externally-anchored severity
Label scopeLocal to one context windowTied to an external rubric (CWE, CVSS, OWASP, in-repo SEVERITY.md)
Portable across runsNoYes, when the rubric is referenced consistently
Survives sparse runs"Critical" still appears even if findings are minorSparse run produces sparse report; no inflation
Survives dense runs"Critical" still appears even if everything is seriousThreshold-defined labels; multiple Criticals are possible
Safe to wire to merge gatesNo, without a human-in-the-loop refutation stepYes, when the rubric is enforced upstream

Both columns look like the same product feature. They are not.

The steel-man: when calibration actually holds

I want to be honest about the strongest counter-argument. Anthropic has stated, in the Claude Code Review launch coverage and product docs, that engineers mark fewer than 1% of findings as incorrect. That number is real. It is also measuring something different from what this post is arguing. "Incorrect" measures whether the finding describes a real artifact in the code. "Severity label is portable" measures whether the Critical / Important / Nit tier the model attached to that finding means the same thing run-to-run. A finding can be a correct observation about the code and still have a context-relative severity label. The two claims are independent.

Three places where the curve genuinely flattens.

Cloudflare runs the first. Ryan Skidmore wrote up Orchestrating AI Code Review at scale in April 2026, covering 131,246 PR review runs that achieve roughly 1.2 findings per review. The architecture is the interesting part. They pre-define hard severity tiers per specialized agent, encode "What NOT to Flag" constraints inside each agent (one constraint reads, roughly, "do not flag theoretical risks that require unlikely preconditions"), and add a coordinator agent that re-reads the source code to apply a reasonableness filter to the findings before they ship. The system works because the calibration is engineered upstream, in the agent prompts, with hard guardrails. The curve still exists at the model layer. The pipeline does not let it propagate downstream. This is the same shape I argued for in the AI Diligence Operating System post: trust comes from the artifact and schema layer, not from the agents producing the output.

Locked-rubric frameworks are the next case. The RULERS paper from January 2026 compiles natural-language rubrics into "versioned, immutable bundles" with structured decoding, deterministic evidence verification, and post-hoc Wasserstein calibration for scale alignment. Hong and colleagues show that this combination outperforms baseline judges in human agreement and lets smaller models rival proprietary ones. The trick is that the model cannot reinterpret the rubric mid-run because the rubric is bound by structure, not by free text. Pathak and colleagues at ACM ICER 2025 demonstrated the simpler version of the same idea: question-specific rubrics raise grading correlation with human judges from r=0.717 (no rubric) to r=0.912 on object-oriented programming code evaluation. Both papers show that the curve is not inevitable. It is the default behavior of a default-prompted pipeline.

Then there is Anthropic's own marketplace plugin skills and the Superpowers ecosystem. I still love Superpowers. The skills that ship through that ecosystem are tooled, calibrated, and battle-tested in ways that an ad-hoc CLAUDE.md prompt is not. I still manually review the findings rather than blind-accepting any of them, but the curve inside those skills is meaningfully tighter than the curve inside a default review prompt. The lesson is the same as Cloudflare's: when the skill brings an external rubric and tooling discipline, the labels behave better. The author of the skill did the engineering work for you.

So the thesis is not "Claude cannot grade severity." The thesis is "default-prompted review pipelines grade on a curve unless you do the engineering work to anchor the labels externally, and Anthropic itself is on the record about this in two product surfaces." If your pipeline includes the engineering work, you are reading this for the diagnostic, not for the diagnosis.

Five patterns that flatten the curve

I work with engineering teams on this exact problem under the Agentic Workflow / Automation Development umbrella. The patterns below are what I reach for when a team's review pipeline is producing labels their wiring takes too seriously. Pick the ones that match where your pipeline breaks.

1. Coverage-first prompting. This is the pattern Anthropic itself prescribes in the prompting best-practices doc. Instead of writing "only report high-severity issues" or "be conservative, do not nitpick," write the inverse:

Report every issue you find, including ones you are uncertain about or consider low-severity. Your goal here is coverage. Severity labeling and confidence filtering happen in a separate downstream step.

The model produces the long list, you filter the list with a second pass or a deterministic rule. You are decoupling discovery from grading. Anthropic frames this as a hedge against Opus 4.7's improved literalism, which causes the model to drop findings rather than surface-and-grade them when the prompt asks for selectivity.

2. Reference an external severity taxonomy, and narrow the scope. Point the review prompt at a Tier-1 anchor. For security work, that is CWE or CVSS or the OWASP Top 10. For your own codebase, a SEVERITY.md at the repo root that defines what Critical means here. Pair the rubric reference with a narrow, scoped diff and a concrete anchor example in the prompt. A free-form "find any bugs across the repo" gives the model the whole spectrum to populate; a scoped "review this 40-line diff against the rubric below, with this past example labelled Critical for reference" gives it almost no spectrum at all. The opus-compatibility-scanner I shipped does this strictly: every Critical finding has to cite an Anthropic-authored URL by source. Practitioner observations cap at Warning. When the scanner runs against a sparse project, the report comes back sparse. There is no inflation pressure because the label tier is gated on whether an external source can be cited at all, not on what else was in the scan. The scanner case study writes up the numbers: three Critical and four Warnings surfaced on my own most-curated Claude Code project, zero or one Critical on the others. Sparse scans produced sparse reports because the tier-1-citation rule cannot inflate.

3. Treat severity as ordinal ranks, not cardinal labels. Inside your automation wiring, read the model output as (rank position) and ground it against (external rubric match). A "Critical" without a matching rubric clause is a "Top finding of this run, please review." It is not a merge-gate-blocking signal. This shifts the question your automation answers from "is this label Critical?" to "does this finding match a rule in our enforced rubric?" The label becomes a pointer to where the human or the deterministic rule should look first.

4. Two-pass coordinator with a reasonableness filter. This is the Cloudflare pattern translated for smaller teams. First pass produces findings under coverage-first prompting (pattern 1). Second pass takes the findings list and the original source code, and asks: "For each finding, given the actual source code, is this severity defensible? If not, downgrade or drop." The second pass operates with a different framing and tool access, so it is less subject to the completion pressure that drove the first pass. The sub-agent orchestration patterns post walks through this shape in more detail; the sequential review chain pattern is the one that maps cleanest here.

5. Encode rubrics where they enforce, not where they advise. This is the most common configuration mistake I see. Teams put their severity rubric inside CLAUDE.md and watch the model ignore it half the time. CLAUDE.md is advisory by design. Anthropic's memory documentation says so directly: it is delivered as a user message after the system prompt, not as the system prompt, and Claude reads it but is not strictly bound by it. If your severity rubric has to hold every time the review runs, it does not belong in CLAUDE.md. It belongs in a pre-tool-use hook, a Claude Code skill body that the review subagent loads on every invocation, or a structured output_config schema with severity enums. The rule-routing decision tree post maps which directive types belong where. The shorthand is: rules that have to hold every time need a hook or a skill, not a markdown file.

If you are reaching for a single highest-impact change, it is pattern 5 followed by pattern 1. Get your rubric onto the right layer of the configuration stack, then change your review prompt from selective to coverage-first. The other patterns sit on top of that foundation.

A two-run diagnostic you can run today

A falsifiable test you can put on your own pipeline this week. It needs nothing you do not already have.

Take a real codebase. Submit it to your review pipeline twice.

Run 1 contains only known trivial nits. Style suggestions. Comment phrasing. The kind of feedback you would accept in a refactor but ignore on a hotfix.

Run 2 is the same codebase with a single genuine bug or vulnerability added. Pick something unambiguous. A SQL injection. A missing auth check. A null-dereference where the input is user-controlled.

Compare the severity labels run-to-run. If your pipeline's top label on run 1 is the same tier as your pipeline's top label on run 2, your pipeline is grading on a curve. The Critical label survived a context that should not have produced a Critical. Your wiring downstream is reading a local rank as if it were a portable signal. The fix is upstream, in one of the five patterns above. It is not in tuning the downstream filter, because the downstream filter is consuming a label that does not mean what the wiring thinks it means.

This is the test I encourage teams to run before they trust their review automation in production. If you want a structured conversation about what the diagnostic surfaced, a 15-minute call is the cheapest next step. The AI Readiness Assessment at /assess/ surfaces architecture gaps that review pipelines miss, including this one.

The single sentence to take away. Severity labels coming out of a Claude review are ordinal ranks inside one context, not portable signals across runs, and any merge gate that treats them otherwise is one nit-filled diff away from blocking a hotfix on a phantom blocker. The mechanism is documented. The fixes are cheap. The diagnostic is two runs.