Why can't you test an agent the way you test code?

The first instinct of every engineer who ships an AI agent is the right instinct: write a test. Capture an input, capture the expected output, assert they match. It works for the function next to the agent and it falls apart for the agent itself, immediately, for a reason that is structural rather than fixable. The agent is non-deterministic. Run the same prompt twice and you can get two different answers, and the gap between them is not a bug you can configure away.

The reflex defense is to pin the temperature to 0 and call it deterministic. It is not. Anthropic's own API documentation says so directly, on the live page that governs the model I run on this codebase:

Note that even with temperature of 0.0, the results will not be fully deterministic. Anthropic, Messages API documentation

That is not a vendor hedge. It is a description of how the inference stack behaves. A NeurIPS 2025 study of the numerical sources of non-determinism in LLM inference held temperature at 0, used greedy decoding, and varied only the runtime setup: across twelve combinations of GPU type, GPU count, and batch size, more than ninety percent of examples diverged in the low-precision bfloat16 arithmetic that production inference normally runs in, and the accuracy of a reasoning model swung by as much as nine percentage points from one configuration to the next. The same examples stayed almost stable, around two percent divergence, only under far slower full-precision math. The cause is not randomness in the usual sense. It is non-associative floating-point arithmetic compounding through a long reasoning chain, so the answer depends on the batch your request happened to land in. A cloud API consumer controls none of that.

The variance is not confined to trivia, either. A December 2025 study of refusal behavior under seed and temperature variation found that between eighteen and twenty-eight percent of safety-relevant prompts flipped decision, the same model refusing a request on one seed and complying on another, and concluded that a single run is not enough to characterize a model's behavior on a property you care about. If a one-shot test cannot reliably tell you whether the model will refuse a harmful request, it cannot reliably tell you much.

There is a deeper problem underneath the variance, and it is older than LLMs. It is the oracle problem: to write assertEquals(actual, expected) you need to know the expected value, and for most things you ask an agent to do there is no single correct answer to put on the right-hand side. A 2026 treatment of metamorphic testing in the age of LLMs, accepted to IEEE Computer, frames the response cleanly: when you cannot specify the expected output, you assert relations between outputs instead. Paraphrasing the question should not change the answer; reordering independent options should not change the ranking. You stop testing for the right answer and start testing for the absence of wrong behavior. That shift is the whole game, and it is the same diagnostic I have written about as the dividing line between vibe coding and agentic development: the trustworthy version is the one where structural guarantees do the work, not the one where you ask the model to be careful and hope.

So reliability for an agent is not a test you pass once at launch. It is a property you re-establish continuously, the way production agentic delivery treats every shipped artifact as something to gate rather than something to trust. The rest of this guide is how that loop is built, and why it has to keep running.

Reliability starts with your known failure modes

If you cannot assert correctness directly, you assert the absence of the specific ways this agent is known to fail. That inversion is the core move. You stop trying to prove the output is right and start proving it is not wrong in any of the ways you have already seen it go wrong. Which means the first artifact of a reliable agent is not a test suite. It is a failure-mode catalog.

The good news is that the failure modes are not infinite and they are not mysterious. A NeurIPS 2025 paper, Why Do Multi-Agent LLM Systems Fail?, built a taxonomy of them: human annotators analyzed 150 traces and agreed on the categories at a Cohen's kappa of 0.88, which matters because it means independent people sort failures into the same buckets rather than inventing their own. That work produced fourteen named failure modes under three structural causes, system design, inter-agent misalignment, and task verification, which the authors then scaled to a dataset of more than 1,600 traces across seven multi-agent frameworks. A separate step-level study, AgentProcessBench, labeled 8,509 individual agent steps and found that even the strongest model it tested judged those steps correctly only about eighty-two percent of the time, with a systematic bias toward calling a flawed step acceptable. The failure modes are common, they are categorizable, and even good models miss them when grading casually.

The discipline, then, is to catalog the failure modes that bite your agent, on your task distribution, and stand up a check for each one. You don't need a complete taxonomy to start. You start with the modes you have already watched the agent hit in production, and the catalog grows every time a new one surfaces. Tool misuse and hallucinated tool calls get a check. Context loss across a long run gets a check. Incorrect termination, the agent declaring victory early, gets a check. The catalog of tool use errors alone is worth its own layer, because a tool-calling agent that selects the wrong tool or fills the wrong argument fails silently and convincingly.

This is the shape the authoring pipeline behind this site runs on, and I describe it at the level of the shape rather than the recipe. I can't prove those internal counts from outside a private repository, which is why the external studies above carry the load-bearing claims here; what the pipeline gives me is the shape. It indexes seven recurring failure modes of AI-assisted writing, each with a dedicated verification layer, so that "is this draft reliable" breaks into seven narrower questions that each have a concrete check. That is the same structure as the opus-compatibility-scanner, which is built around a catalog of named patterns rather than a general "is this config good" judgment, and it is what lets a finding be specific enough to act on. The failure-mode index is also where one of my own posts on what "critical" actually means in an AI code review lands: a severity label is only meaningful when it maps to a named failure mode, not when it is a vibe the model emitted. Reliability, in this framing, is the inverse of a catalog you can name; the modes you have not cataloged yet are exactly what the continuously re-running loop in the sections that follow is built to surface. The chain of checks that enforces the catalog is the AI authoring trust chain.

What does a deterministic validator do that a model judge can't?

Once you have a failure-mode catalog, the obvious next move is to let the model check itself. Ask a second model, or the same model in a reviewer role, to grade the output against the catalog. This is LLM-as-judge, and it is genuinely useful. The founding benchmark for it, Judging LLM-as-a-Judge, measured a strong model judge agreeing with human experts more than eighty percent of the time, about the same rate at which the human experts agreed with each other. A model judge is a genuine signal. It's not nothing.

But it is also not a floor, because the judge carries the same biases as the generator. The same founding work measured a judge preferring whichever answer was shown first and a judge preferring the longer answer regardless of quality, and it saw a judge appear to favor its own outputs, an effect the authors flagged as suggestive rather than proven. Those biases did not disappear when the field got more rigorous. A 2026 study of self-preference bias in rubric-based evaluation found that even with an objective rubric, a model was as much as fifty percent more likely to wrongly mark its own output as satisfying a criterion it had failed than to make the same error on another model's output, and on a medical benchmark that bias moved scores by as much as ten points, enough to reorder a comparison between frontier models. Ensembling several judges helped; it did not eliminate the problem. Worse, the judge's reliability is not stable across the inputs you care about most. A 2026 study tellingly titled A Coin Flip for Safety found that under the adversarial, out-of-distribution inputs that matter for robustness, judge performance degraded toward chance, validated against more than six thousand human-labeled examples.

Put those together and the conclusion is not "do not use model judges." It is that a calibrated rubric judge is a ceiling-raiser, not a floor. The floor has to be something that does not share the generator's failure modes at all: a deterministic validator. A script that checks whether the cited number appears in the cited source doesn't have a self-preference bias. A gate that fails closed when a required field is missing does not get talked out of it by a confident paragraph. Not every property reduces to a script, though: whether an answer is fair, whether a plan is safe, whether it solves the user's real problem are semantic calls with no deterministic check to anchor them, and there the model judge plus a human spot-check is the best floor on offer, which is a weaker floor and worth treating as one. The model judge and the deterministic gate catch different classes of error, and the reliable design uses the deterministic layer as the thing that cannot be argued with and the model layer as the lift on top. That division is exactly what I found measuring where the advisor model is blind: a transcript-scoped model reviewer is a quality lift but is structurally blind to the file system and structurally prone to agreeing with you, so it cannot be the gate. The same lesson shows up the moment teams let agents auto-fix their own pull requests: the green check stops meaning "the code is right" and starts meaning "the agent satisfied the checks," and if the checks are only other model judgments, the green is hollow.

The strongest on-site proof of the division is the AI persona profiler, where two model analysts with deliberately different stances, one standard and one skeptical, are reconciled by a third agent and then scored against a twelve-point calibrated rubric that produced a fidelity result of 59 out of 60. The adversarial model structure raises the ceiling; the quantitative rubric and the reconciliation step are what keep a single confident analysis from being the last word. Model judgment backed by a deterministic floor is the pattern. Model judgment alone is the failure.

Why does the verification loop have to keep running?

Here is the part that the one-time-evaluation framing misses entirely, and it is the part I care about most: even a perfect evaluation rots the instant the system changes, and the system changes constantly. A new model version ships. A prompt gets edited. A tool's output format shifts. Any one of those can move behavior, and the move is rarely advertised.

The magnitude is larger than intuition suggests. A 2026 study, Beyond the Mean, measured what a model-version upgrade does at the level of individual items rather than aggregate score. On one upgrade, among the items it could analyze, about a third of the answers improved while more than a quarter got worse; on another, nearly half improved and close to forty percent deteriorated. The aggregate accuracy barely moved in either case, which is the trap: the headline number looked stable while a large share of the underlying answers had flipped in one direction or the other. And the same study found that a single-shot evaluation missed forty-two percent of the items that had genuinely changed and falsely flagged a quarter of the items that had not. A one-time eval would have told you everything was fine. The canonical demonstration of this is older and blunter: a 2023 study of how ChatGPT's behavior changed over time found one task where accuracy fell from eighty-four percent to fifty-one percent over three months on what was nominally the same service, and the authors' direct recommendation was continuous monitoring.

This is why the verification loop is a loop and not a launch gate, and it is why the checks are built to fan out in parallel. When reliability is the inverse of a catalog of failure modes, each mode's check is a separate read over the finished output, which means they can all run at once, without waiting on each other, even where the underlying failures are causally linked. That parallel re-verification is an application of subagent orchestration: independent lanes, each isolated in its own context, dispatched together and reconciled at a gate. I have written the deep version of that mechanism in the companion guide on subagent orchestration in production, and the four orchestration patterns that cover most production workloads include the evaluator-optimizer shape that this loop is a case of. On this codebase the authoring pipeline re-runs its parallel validation wave whenever the draft, the model, or the knowledge base changes, in full on a model or knowledge-base change and scoped to the affected checks on a smaller edit, because a wave that ran once at the start would be measuring a system that no longer exists.

Running the loop continuously also forces a question the loop has to answer: when behavior changes, what changed? That is what agentic pipeline provenance is for, the per-artifact record of which model, which skill version, and which knowledge-base state produced an output, so a regression can be traced to its cause rather than guessed at. And the cost of running this loop is not a rounding error in a serious team's time. One widely cited practitioner account, Hamel Husain's evaluation FAQ, reports that on the projects his team has worked on, the majority of development time, on the order of sixty to eighty percent, goes to error analysis and evaluation rather than to the agent itself. That is not waste. That is what reliability costs, and the teams that skip it are the ones the academic literature keeps catching: a 2026 framework on the science of AI agent reliability found that recent capability gains have produced only small improvements in reliability, and that an agent able to solve a task is not the same as one that solves it consistently across repeated runs. Capability is what the agent can do on a good day. Reliability is what it does every day, and only a loop that keeps running can tell the difference.

When is a full verification loop overkill?

The honest answer is that the loop has a real cost and not every agent needs all of it, and a guide that pretended otherwise would be selling something. So here is the strongest version of the case against it.

The sharpest objection is that public benchmarks do not measure your system. A 2026 analysis, How Well Does Agent Development Reflect Real-World Work?, mapped 43 published benchmarks containing more than seventy-two thousand tasks against the actual distribution of economic work and found a heavy programming-centric skew that does not match where most agents are deployed. A compliance agent or a triage agent evaluated against a software-engineering benchmark is being graded on the wrong exam, and a green score there is close to meaningless for its real job. There is a second, related objection: continuous evaluation can manufacture false confidence, an always-green dashboard built on contaminated benchmarks and biased judges that masks real regressions rather than catching them.

Both objections are correct, and both are arguments for the loop rather than against it, because both point at the same fix: index your own failure modes on your own traces instead of borrowing someone else's exam. The thing that makes the steel-man bite, the gap between benchmark performance and real reliability, is exactly the gap that a failure-mode-indexed loop on your production traces is built to close. Your own traces carry their own blind spot, of course: they underrepresent the rare, adversarial, and not-yet-seen failures, which is why a serious loop is fed deliberately hard cases and red-team inputs, not just yesterday's production sample. That most teams measure the wrong thing is documented directly. A 2025 review of the measurement imbalance in agentic AI evaluation found that eighty-three percent of the evaluation studies it surveyed relied on technical metrics, and only fifteen percent combined technical and human dimensions. Most agentic AI, in other words, is graded on whether it ran, not on whether it worked.

So the loop scales to the stakes rather than running at full strength everywhere.

The agentWhat the loop should be
Short, bounded, low-stakes, where a wrong answer is cheap to catch and cheap to fixA light check on the output. No failure-mode index, no parallel wave.
Multi-step, but with stable inputs and a tolerant failure costA small index of the failure modes you have already seen, re-run on change. Skip the heavier deterministic floor.
High-stakes, where a silent regression compounds before anyone notices, or facing frequent model and prompt changesThe full loop: a failure-mode index, a deterministic floor under model judgment, and continuous parallel re-verification on every change.

That sizing is the same judgment any reliability investment requires, and it is the governance question I have written about for engineering managers governing agentic development: mandate the checks that have to hold, recommend the ones that usually should, and leave the rest discretionary. The honest accounting is that the loop is not free, and the table is how you decide where its cost is repaid by the failure it prevents. The mistake is not running too little loop on a toy. The mistake is running no loop on something that matters and calling the launch-day green check reliability.

How does this page stay current?

This cornerstone is a peer of Running Claude Code as a Production Engineering Practice, the parent cornerstone on this site, and a companion to the subagent orchestration guide whose fan-out mechanism this loop depends on. Its anchor is a first-party operational record that lives next to this body and is updated when a new failure mode is observed and given a layer, or when the parallel-wave shape changes.

The argument here is deliberately built on external evidence rather than on internal counts, because this site's repository is private and a number you cannot verify is not a number you should trust. The non-determinism studies, the failure-mode taxonomies, the LLM-as-judge bias measurements, and the version-drift churn data are the load-bearing claims; the practice on this codebase is the lived shape they imply. The Sources roster below tracks the freshness of each external anchor under the three-month cap for AI and SaaS findings and the six-month cap for tool-capability findings that govern this site's authority pages, and a source held past its cap is kept only with a documented search trail showing what was looked at and why nothing fresher qualified.

The loop that verifies the agents is itself a system that evolves, which is the recursive point of the whole guide: the same dual-cap freshness rule that keeps these sources current is the page-level instance of re-running verification as the world changes. When I scope a consultation, workshop, or implementation engagement around agentic development, building this verification loop for a team's own failure modes, on their own stack, is part of what I ship.