baseline harness

Know when your
agents regress

TraceGrade runs standardized cohorts against your AI agents continuously. Same scenarios, every deploy. Drift surfaces before users notice.

$ tracegrade run --cohort baseline-v3 --model sonnet-4

PASS  task_completion      0.94  (baseline: 0.92)
PASS  tool_correctness     0.97  (baseline: 0.95)
DRIFT hallucination_rate   0.08  (baseline: 0.03) +167%
PASS  latency_p95          2.1s  (baseline: 2.4s)
PASS  cost_per_task        $0.12 (baseline: $0.14)

cohort: 48/50 scenarios passed | 1 regression detected
report: https://tracegrade.polsia.app/runs/r-4821

Cohort-based quality measurement

Define scenarios once. Run them on every model change, routing update, or infrastructure deploy. TraceGrade builds a quality history so you see exactly when and why things changed.

Scenario Cohorts

Group standardized test scenarios that represent your production workload. Same inputs, measured outputs, every time.

Drift Detection

Statistical comparison against your quality baseline. Know if a model swap or routing change degraded any dimension.

Multi-Model Routing

Track quality per-route. Opus vs Sonnet vs Haiku, each with their own baseline and regression thresholds.

Continuous Runs

Schedule cohort runs on every deploy, daily, or triggered by model version bumps. Quality history builds automatically.

50+
Evaluation dimensions
<5min
Cohort run time
0.01%
False positive rate

Quality is a moving target.
Baselines keep you honest.

Every model update, every routing change, every infrastructure tweak shifts your agent's behavior. TraceGrade measures the shift so you ship with confidence.