Benchmarking & Evaluation Design

Publish at:

You can build the most elegant agent architecture in the world, but without rigorous evaluation you are flying blind. Observability tells you what happened on a given run. Evaluation tells you whether your agent is getting better or worse over time — and whether a change you are about to ship will help or hurt. We will cover the mechanics of building custom benchmarks, running evaluations with statistical rigor, avoiding the trap where optimizing a metric destroys the thing you actually care about, and treating evaluation as a product in its own right.

Why Agent Evaluation Is Uniquely Hard #

Evaluating a classifier is straightforward: you have labeled data, you compute precision and recall, you are done. Agents break this simplicity in several ways.

First, non-determinism. The same task run twice may produce different tool-call sequences, different intermediate reasoning, and a differently worded final answer — all of which could be correct. You cannot compare outputs with string equality. You need semantic judges, and those judges introduce their own noise.

Second, multi-dimensional correctness. A "correct" agent run is far beyond final answer. Did it use the right tools? Did it stay within cost budget? Did it avoid unsafe actions? Did it finish in a reasonable number of steps? A single pass/fail score collapses all this into one bit when you really need a vector.

Third, distribution shift. Agents operate on real-world inputs that drift over time. A benchmark built from last quarter's queries may not predict performance on next quarter's queries. Your evaluation must evolve alongside the workload.

Fourth, evaluation cost. Running a full end-to-end agent evaluation means burning model tokens — potentially thousands of them per test case. You cannot run exhaustive benchmarks on every commit the way you run unit tests. Budget forces you to be strategic about what you evaluate, when, and how often.

Building Custom Benchmarks #

Off-the-shelf benchmarks (HumanEval, MMLU, SWE-bench) measure general model capability, not your agent's capability on your tasks. Custom benchmarks are non-negotiable for production agents. Here is how to build one.

Anatomy of a Benchmark Case #

Each benchmark case needs four components:

from dataclasses import dataclass, field
from typing import Any


@dataclass
class BenchmarkCase:
    # The task input — what the user would say or what triggers the agent
    task: str

    # The ground truth or acceptance criteria
    expected: dict[str, Any]

    # Context the agent should have access to (documents, DB state, etc.)
    context: dict[str, Any] = field(default_factory=dict)

    # Metadata for slicing results
    tags: list[str] = field(default_factory=list)

    # Difficulty tier for stratified analysis
    difficulty: str = "medium"

The expected field deserves attention. For some tasks it is a concrete answer ("revenue was $4.2M"). For others it is a set of constraints: the agent must call a specific tool, must not call a dangerous one, must produce output containing certain facts. Design your expected field as a contract, not a literal string.

Dataset Construction #

The hardest part of benchmarking is building a representative dataset. There are three reliable sources:

Production traces. Sample real user requests from your observability system. Filter for ones where you have high-confidence ground truth — either because a human reviewed the output, or because the task has a verifiable answer. This is your most valuable source because it reflects the actual distribution.

Expert-authored cases. Have domain experts write tasks that exercise specific capabilities: edge cases, multi-step reasoning, ambiguous queries. These complement production sampling by covering scenarios that are important but rare.

Synthetic generation. Use a strong model to generate task variations from seed examples. This scales volume but introduces a subtle bias: the synthetic tasks will cluster around what the generating model finds easy to articulate. Use synthetic data to pad coverage, not as your primary source.

A practical benchmark for an agent typically needs 100-500 cases to be statistically meaningful. More on the statistics below.

Stratification and Tagging #

Raw pass rates are deceptive. An agent that scores 85% might be perfect on simple tasks and terrible on hard ones — or flawless at retrieval but broken at multi-step planning. You need to slice results by meaningful dimensions:

def analyze_results(
    results: list[dict], stratify_by: str = "difficulty"
) -> dict[str, float]:
    """Compute pass rates per stratum."""
    buckets: dict[str, list[bool]] = {}
    for r in results:
        key = r.get(stratify_by, "unknown")
        buckets.setdefault(key, []).append(r["passed"])
    return {k: sum(v) / len(v) for k, v in buckets.items()}

Tag cases with dimensions that matter to you: difficulty, capability type (retrieval, reasoning, tool-use, generation), domain (finance, support, engineering), input length, expected step count. When a benchmark run fails, slicing tells you where to look.

Statistical Rigor in Evaluation #

Agent evaluations are noisy. A model that passes 80% of cases on one run might pass 76% or 84% on the next. If you treat every fluctuation as meaningful, you will chase ghosts. If you ignore all fluctuation, you will miss real regressions. Statistical rigor is the discipline of knowing the difference.

Confidence Intervals #

Never report a point estimate without a confidence interval. If your agent passes 82 of 100 cases, the 95% confidence interval for the true pass rate is roughly 73%-89% (using a Wilson interval). That is a wide range — wide enough that a "regression" from 82% to 78% is probably noise.

import math


def wilson_interval(
    successes: int, trials: int, z: float = 1.96
) -> tuple[float, float]:
    """95% confidence interval for a proportion (Wilson score)."""
    if trials == 0:
        return (0.0, 0.0)
    p_hat = successes / trials
    denominator = 1 + z * z / trials
    center = (p_hat + z * z / (2 * trials)) / denominator
    spread = z * math.sqrt(
        (p_hat * (1 - p_hat) + z * z / (4 * trials)) / trials
    ) / denominator
    return (max(0.0, center - spread), min(1.0, center + spread))

If the confidence intervals of two runs overlap substantially, the difference is not statistically significant. Do not deploy based on overlapping intervals.

Sample Size and Power #

How many benchmark cases do you need? It depends on the effect size you care about. If you want to detect a 5-percentage-point regression (say, from 85% to 80%) with 80% power at the 95% confidence level, you need roughly 400-500 cases. For a 10-point swing, 100 cases suffices.

Handling Non-Determinism #

Because agents are stochastic, a single run per case is insufficient. Run each case k times (typically k = 3-5) and use the majority vote or mean score as the case-level result. This smooths out random variation and gives you a more stable signal.

from statistics import mean


async def evaluate_case_with_repeats(
    case: BenchmarkCase, agent_fn, judge_fn, k: int = 3
) -> dict:
    scores = []
    for _ in range(k):
        trajectory = await agent_fn(case.task, case.context)
        score = await judge_fn(trajectory, case.expected)
        scores.append(score)
    return {
        "task": case.task,
        "mean_score": mean(scores),
        "passed": mean(scores) >= 0.5,
        "variance": max(scores) - min(scores),
        "tags": case.tags,
        "difficulty": case.difficulty,
    }

High variance on a single case is itself a signal — it means the agent's behavior is fragile on that input. Track per-case variance as a first-class metric.

LLM-as-Judge - Automated Grading #

For tasks without deterministic answers, you need a judge — a separate model that evaluates whether the agent's output meets the acceptance criteria. This is the dominant evaluation pattern for production agents.

Designing a Judge Prompt #

A good judge prompt has three parts: the grading rubric, the agent's output, and the reference answer or constraints. Keep the rubric specific and structured to reduce judge noise.

JUDGE_PROMPT = """You are evaluating an AI agent's response.

## Task
{task}

## Agent Output
{agent_output}

## Reference Answer
{reference}

## Grading Rubric
Score each dimension from 0 to 1:
- correctness: Does the output contain the right facts?
- completeness: Does it address all parts of the task?
- groundedness: Is every claim supported by the provided context?
- safety: Does it avoid harmful, biased, or inappropriate content?

Return a JSON object with these four scores and a brief justification.
"""

Judge Calibration #

LLM judges are not perfect. They have biases: they tend to prefer longer responses, struggle with numerical precision, and may disagree with human raters on subjective quality. You must calibrate your judge.

Calibration means running the judge on a set of cases where you already have human ratings and measuring agreement. Track two numbers: Cohen's kappa (inter-rater agreement) and mean absolute error (for numerical scores). A kappa above 0.7 is good; below 0.5 means your judge is unreliable and you need to revise the rubric.

When the judge disagrees with humans, examine the failure cases. Often the fix is a more specific rubric — "correctness" is vague; "the output must contain the exact dollar amount from the source document" is actionable.

Cost Management for Judges #

Judge calls are expensive — you are making an extra model call for every evaluation. Two strategies help:

Tiered judging. Use a cheap model for easy pass/fail decisions (did the agent call the right tool?) and a strong model only for nuanced quality judgments (is this summary faithful to the source?).

Cached judgments. If the same agent output appears across runs (because the task and context are identical and the agent produced the same trajectory), cache the judgment. Invalidate the cache when you change the judge prompt or model.

Goodhart's Law and Metric Gaming #

"When a measure becomes a target, it ceases to be a good measure." This is Goodhart's law, and it is the single biggest threat to agent evaluation. Here is how it manifests.

You define a benchmark, optimize your agent against it, and watch scores climb. After a few iterations, scores are high — but real-world performance has not improved, or has even degraded. What happened? The agent (or your prompt engineering) learned to exploit patterns in the benchmark that do not generalize.

Common Failure Modes #

Overfitting to phrasing. If all your benchmark tasks use similar sentence structures, the agent learns to pattern-match on phrasing rather than genuinely understanding the task. When real users ask the same question differently, performance drops.

Narrow tool coverage. If your benchmark exercises only three of ten available tools, you have no signal on the other seven. The agent may appear competent while being broken on untested paths.

Judge hacking. If you optimize against an LLM judge, the agent may learn to produce outputs that look good to the judge (verbose, well-structured, confident) without actually being correct. The judge's biases become exploitable attack surface.

Difficulty collapse. Over successive iterations, the easy cases dominate your pass rate, masking poor performance on hard ones. You celebrate 92% overall while difficult multi-step tasks sit at 40%.

Mitigations #

Rotate benchmarks. Maintain a held-out evaluation set that you never optimize against directly. Run it periodically as a sanity check. If your optimized-set score rises while the held-out score stagnates, you are overfitting.

Stratified reporting. Always report scores per difficulty tier and capability category. Never look at aggregate numbers alone.

Adversarial additions. Continuously add new cases designed to break the current agent. When you fix a production failure, add it to the benchmark as a regression test. Your benchmark should grow over time.

Multiple judges. Use two or more judge models and flag cases where they disagree. Disagreements indicate either a genuinely ambiguous case or a judge bias that needs correction.

Human spot-checks. No automated evaluation fully replaces human review. Sample 5-10% of evaluated cases weekly for human audit. Track judge-human agreement over time.

Evaluation as a Product #

Evaluation is a product — something you build, maintain, version, and iterate on, with its own lifecycle and stakeholders. Teams that treat evaluation as an afterthought end up with stale benchmarks that tell them nothing useful.

The Evaluation Pipeline #

A mature evaluation system looks like this:

┌────────────┐      ┌─────────────┐     ┌────────────┐      ┌────────────┐
│  Benchmark │────▶ │  Agent Run  │────▶│   Judge    │────▶ │  Analysis  │
│  Dataset   │      │  (k repeats)│     │  (scoring) │      │  & Report  │
└────────────┘      └─────────────┘     └────────────┘      └────────────┘
       │                                                         │
       ▼                                                         ▼
┌────────────┐                                          ┌────────────────┐
│  Version   │                                          │  Dashboard &   │
│  Control   │                                          │  Alerts        │
└────────────┘                                          └────────────────┘

The benchmark dataset is versioned (in git or a dedicated artifact store). Every evaluation run records which dataset version, which agent version, and which judge version were used. Without this lineage, you cannot reproduce past results or understand what changed.

When to Run Evaluations #

Different evaluation tiers run at different cadences:

Tier What it tests When to run Typical cost
Component tests Tool logic, parsers, guardrails Every commit (CI) Free (no model calls)
Single-turn evals Tool selection, output quality Nightly or pre-merge $1-10 per suite
Full trajectory evals End-to-end task completion Before deploy, weekly $50-500 per suite
Held-out benchmark Generalization check Monthly, never optimize against $50-200 per run
Human audit Judge calibration, edge cases Weekly sample Human time

Versioning Benchmarks #

Benchmarks drift. New capabilities require new test cases. Old cases become irrelevant as the product changes. Treat your benchmark dataset the same way you treat code:

  • Commit messages explain why cases were added or removed
  • Changelogs track dataset versions alongside agent versions
  • Deprecation of stale cases is explicit, not silent deletion
  • Provenance records whether a case came from production, expert authoring, or synthesis

When you upgrade your benchmark, re-run previous agent versions against the new dataset. This lets you distinguish "the agent got better" from "the benchmark got easier."

Regression Testing at Scale #

Every production failure is a future benchmark case. When a user reports that the agent gave a wrong answer, did something unsafe, or got stuck in a loop, capture the minimal reproduction:

from dataclasses import dataclass
from typing import Any


@dataclass
class RegressionCase:
    source: str  # "production_incident_2024_0847"
    task: str
    context: dict[str, Any]
    expected: dict[str, Any]
    failure_mode: str  # "hallucinated_tool_args", "infinite_loop", etc.
    added_date: str
    resolved_date: str | None = None

Run regression cases on every deploy candidate. A case is "resolved" only when the agent passes it consistently (multiple runs, majority pass). Never remove a regression case — even if the underlying bug was fixed, the case continues to guard against recurrence.

Conclusion #

Benchmarking and evaluation for agents is a core engineering discipline that shapes every decision you make about your agent. Build custom benchmarks from production traces, expert scenarios, and synthetic variations. Apply statistical rigor: use confidence intervals, run multiple repeats, and size your dataset to detect the regressions you care about. Stay vigilant against Goodhart's law by rotating benchmarks, stratifying results, and maintaining held-out sets. Calibrate your LLM judges against human ratings and track agreement over time. Treat your evaluation pipeline as a product with versioning, lineage, and dedicated maintenance. The teams that invest here compound their advantage — each iteration is grounded in evidence, not intuition.