Agent evaluation

Agent evaluation is the practice of measuring whether an AI agent succeeds at a task by scoring its behavior, its tool calls, and its final outcome against explicit criteria, run as a repeatable harness rather than judged on a one-time impression.

How it works

Agent evaluation starts by naming what success means for a task, then running the agent across a set of cases and scoring each run against those criteria. The scoring can read the final answer, but for an agent it usually has to read more: the sequence of tool calls, the intermediate decisions, and the state the agent changed, because two runs can reach the same answer by very different paths and only one of them is trustworthy. Graders come in three shapes I combine: a deterministic check for what a rule can decide, a model asked to judge against a rubric for what a rule cannot, and a human for the cases that need one. The cases and the graders together form a harness I can rerun, so a change to the agent is measured against the same bar instead of judged fresh each time. The result is not a single pass or fail but a profile across cases and failure modes, which is what shows where the agent is weak.

Why it matters

A probabilistic system that is right most of the time and wrong unpredictably cannot be signed off on a demo, because a demo samples the cases that already work. Evaluation replaces the question of whether the output looked right with how often it passes criteria I committed to before I looked, which is the difference between a vibe and a number I can defend. The honest limit is that an evaluation is only as good as its cases and its graders: a harness blind to a failure mode reports green while that failure ships, so a passing eval is evidence about what it tested and silence about everything else. Evaluating an agent is harder than evaluating a single answer, because the thing under test acts over many steps, so the cost of building and maintaining the harness is real and has to be matched to the cost of the agent being wrong. It is the measurement layer above a verification loop: the loop decides whether one output is trustworthy in the moment, while evaluation supplies the pre-release evidence for whether the agent is ready to ship, alongside the runtime controls that catch what evaluation cannot.

In practice

Before changing the model behind an agent, I run it against a fixed set of representative tasks and score each run on whether it called the right tools in a sensible order and reached the correct outcome, not only on whether the final text reads well. The new model improves the answers on some cases and quietly regresses the tool sequence on others, which a glance at the output would miss but a trajectory score, a score of the path the agent took, catches. Because the cases and graders are identical across both runs, the decision to adopt the change rests on a comparison rather than an impression.

Practical considerations

The first design decision is what the grader can see, since an agent eval that scores only the final answer is blind to a wrong path that happened to land on a right result. Trajectory scoring, comparing the agent's sequence of steps and tool calls against a reference or judging it for appropriateness, catches a class of regression that outcome-only scoring cannot, and the two are usually layered. Model-based graders are themselves probabilistic, so a rubric they apply needs its own spot checks, and a deterministic grader is preferable wherever the criterion can be written as a rule. The cases are both the asset and the liability: too few and the eval is anecdote, too stale and it measures last quarter's failure modes, so the case set is maintained rather than written once. Eval cost scales with how much of the trajectory is scored and how many cases run, so the rigor of the harness is matched to the cost of shipping a worse agent, not maximized by default.

Related standards and prior art

Anthropic: demystifying evals for AI agents · 2026-01-09 defines agent evaluation as tasks, trials, graders (code-based, model-based, human), traces, and outcomes run as an eval harness
LangSmith: trajectory evaluations · continuously updated independent framework documenting trajectory scoring: matching an agent run against a reference trajectory or judging the tool-call sequence and its appropriateness

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms

Appears in