A canary baseline is a fixed set of prompts or workflows re-run on a schedule against pinned expectations, so that drift in model or agent behavior, whether from serving changes, safety interventions, prompt edits, or migrations, surfaces as a visible comparison delta instead of hiding inside individually success-shaped responses.

How it works

The pattern borrows release-engineering canarying, where a canary and a control are compared on a metric and a large enough delta triggers investigation, and runs it over time instead: the control is the pinned baseline, the expected outputs or scores of a fixed set at a known model version and configuration, and the canary is today's re-run. Agent evaluation practice supplies the feed: capability evals that reach high pass rates graduate into regression suites run continuously to catch drift. Measurement research backs the mechanism, using repeated runs of fixed inputs to quantify baseline behavioral drift even where nothing in the serving path admits to changing. A delta names the symptom, never the cause: a serving-stack change, a vendor safety screen, a quiet default shift, and your own prompt edit all produce the same signal, so the canary's job is to convert invisible change into a triage prompt. Set composition is aimed rather than uniform: critical workflows get covered first, plus any topic family the vendor's own documentation flags as subject to intervention. Pinning is what makes the comparison interpretable, the model id, the prompts, the sampling settings, and the scorer all held fixed, so that a moved number is a candidate behavior change rather than a harness change, confirmed by repeat runs before it is treated as drift.

Why it matters

Refusals, fallbacks, and errors are instrumentable, and dashboards built on them are blind by construction to the failure classes that arrive success-shaped: silent degradation, slow serving drift, and regressions on a supposedly pinned model. For those classes detection has to move from the response to the trend, and a scheduled baseline is the controlled caller-side mechanism for it, alongside sampled review and live quality monitors for the failures no fixed set anticipates. The baseline also protects evaluation itself, provided some cases stay held out of tuning and the scorer is versioned, because an eval suite quietly measuring a steered or drifted variant of its model degrades without any record that the comparison moved. Around migrations it earns its keep twice, providing the before-and-after comparison that turns a model swap from an impression into a measurement. The honest limits are scope and attribution: a canary only covers what the fixed set exercises, and it cannot prove what caused a delta, only that one exists and where.

In practice

A team pins a model version for a compliance assistant and re-runs a fixed prompt set every morning against stored expectations. After an upstream serving change, answers in one topic family shift while every error dashboard stays flat. The canary diff flags the family the same morning, and because an approved fallback route already exists, the team reroutes the affected workflows while triaging cause against vendor documentation and their own change log. Without the baseline, the drift had no signal to arrive on except a user complaint.

Practical considerations

Pin everything that defines the baseline, the model id, the prompts, the sampling settings, the tool and retrieval versions, and the scorer, so a delta points at behavior rather than at the harness. Standing one up is mostly naming local decisions before the first run: the fixed set's size and coverage rule, where expectations and run history live, what the scorer is, what delta triggers triage, who owns the cadence, and when a re-baseline is allowed. Schedule by exposure: daily on the routes where drift is expensive, and on demand on both sides of any migration or prompt change. Treat a delta as a candidate rather than a verdict, because unrelated causes share the symptom: a single red day usually means rerun and inspect, while a persistent step change or a credible trend is evidence. Refresh the set as workloads change, since an aging canary measures yesterday's risk, and keep held-back cases out of every tuning loop, with expectation changes gated behind an explicit re-baseline event, so the baseline can't be optimized against. Seed topics from the model's safety documentation, past incidents, support tickets, and high-exposure routes, since disclosed policy areas are plausible intervention points rather than a complete map.

Related standards and prior art

Defined by Ready Solutions AI