LLM-as-a-judge is an evaluation pattern in which one language model scores or critiques another model's or agent's output against an explicit rubric, used to assess quality dimensions that a deterministic rule cannot decide.
How it works
I give the judge model the output to assess, usually alongside the criteria it should apply and sometimes a reference answer or the original input, and prompt it to return a score, a label, or a critique against that rubric. The pattern takes two common shapes: pointwise scoring, where the judge rates a single output against the rubric, and pairwise comparison, where it decides which of two outputs is better, which tends to be steadier than asking for an absolute score. Because the judge is itself a probabilistic model, the rubric is the load-bearing part: vague criteria produce a confident but arbitrary verdict, while criteria that name what good and bad look like give the judge something to apply consistently. The judge also brings known biases to the task, favoring the answer in a particular position or the longer of two responses, and in some settings preferring its own model's output, so the design has to account for them rather than assume a neutral grader. For that last reason the judge is usually a different model than the one that produced the work, and its agreement with human reviewers is checked on a sample before its scores are trusted at scale.
Why it matters
Some of the dimensions that decide whether an agent's output is good cannot be written as a rule: whether an answer is faithful to its source, whether a summary is helpful, whether a tone is right. A deterministic check is silent on all of them, which is the gap a model judge fills, and that reach is the reason the pattern spread. The trade-off is that the judge is probabilistic in the same way as the work it grades, so a judge that cannot see a class of error passes it as confidently as a correct answer, and a green score from a weak judge is a false signal rather than a safe one. This is why I treat a judge as a layer above a deterministic floor rather than a replacement for it: the rule catches what it can articulate cheaply and identically, and the judge is reserved for the nuance the rule cannot reach. The discipline that makes a judge trustworthy is calibration, checking its verdicts against human judgment on a sample before relying on them, because an uncalibrated judge measures its own biases as much as the output. The honest framing is that a judge moves a quality question from unmeasurable to approximately measurable, not from unmeasurable to solved.
In practice
A pipeline drafts a customer-facing summary, and no rule can decide whether the summary stays faithful to the source document it condenses. Instead of trusting the draft, I have a second model read the source and the summary together and judge faithfulness against a short rubric, returning a pass with a reason or a flag for the specific sentence that drifted. The judge is a different model than the one that wrote the summary, and before I rely on its verdicts I check a sample of them against my own reading to confirm it agrees with a human on the cases that matter. What ships is not the draft the first model produced; it is the draft that survived a check the first model did not author, run by a grader whose agreement with human judgment I confirmed first.
Practical considerations
Pairwise comparison is usually steadier than absolute scoring, so when the question is which of two outputs is better I prefer it to asking for a number the judge has no calibrated scale for. The known biases each have a standard mitigation: swap the order of two candidates and keep only verdicts that survive the swap, instruct the judge to weigh substance over length, and avoid having a model grade its own output. The judge model has to be capable enough to hold the rubric and the material at once, so a judge chosen only because it is cheaper than the model under test often grades worse than it appears to. Every judged check is an extra model call, so judging every step of a long run adds latency and cost that has to be matched to the cost of shipping a worse output. The rubric is worth versioning and spot-checking like any other asset, because a judge applying a stale or ambiguous rubric drifts quietly. Where a criterion can be expressed as a rule at all, a deterministic check is cheaper to trust than a judge, so the judge is best reserved for the dimensions that genuinely cannot be ruled.
Related standards and prior art
- Zheng et al.: Judging LLM-as-a-Judge (NeurIPS 2023) · 2023-12-24 the paper that named the LLM-as-a-judge pattern, established position and verbosity bias, examined self-enhancement bias, and framed judge-versus-human agreement
- Anthropic: build evaluations (LLM-based grading) · continuously updated vendor eval docs treating LLM-based grading as one of three grading methods, with rubric-design guidance and the use-a-different-judge-model best practice
- Anthropic: demystifying evals for AI agents · 2026-01-09 names model-based graders as flexible and scalable but non-deterministic, requiring calibration against human judgment
- Evaluating Scoring Bias in LLM-as-a-Judge (DASFAA 2026) · 2025-06-27 independent peer-reviewed study identifying scoring-prompt biases (rubric order, score id, reference answer) beyond the 2023 originator
Defined by Ready Solutions AI