Why are AI benchmark scores misleading?

Because a benchmark is a proxy, and the moment a proxy becomes a target it stops measuring what you care about. Models can be tuned to the test, training data can be contaminated with benchmark questions, and a score on generic problems rarely predicts performance on your specific, messy, real-world task. A better score is not a better system.

What is the most reliable way to test an AI?

Climb the ladder of evidence: an anecdote is weaker than a benchmark, a benchmark is weaker than a controlled test, and the strongest evidence is adversarial testing plus measured behavior in real production conditions. Pair every result with a control that rules out luck or artifacts, and report honestly what the evidence does and does not show.

Does a higher accuracy number mean a better AI?

Not by itself. Accuracy can rise for reasons that have nothing to do with real capability — an easier test set, a lucky run, a leaked answer key, or a metric that rewards the wrong thing. Without controls and adversarial cases, a higher number can hide a worse system. What matters is whether the measurement is honest and reproducible.

Field note · Reliable AI

How Do You Know What an AI Can Actually Do?

Q: How do you evaluate an AI system?

You measure it against the work it will actually do, not a generic leaderboard — using tasks drawn from your real use case, adversarial cases designed to break it, and controls that prove the result isn't an artifact. The goal isn't to produce a high number; it's to produce a number that is true about your system. A single benchmark score, taken alone, tells you almost nothing about whether the AI will hold up in production.

Published June 28, 2026 · Empire Publishing

Short answer: You test it against the work it will really do — not a leaderboard — with adversarial cases and controls that prove the result is real. The trap to avoid fits in one line: a better score is not a better system. The hard part isn't producing a number; it's producing a number that's actually true about your AI.

The number that lets you ship

Most AI evaluation is quietly designed to produce a number that lets you ship — a green checkmark, a benchmark you topped, a demo that worked on stage. That's the opposite of finding out what's true. Real evaluation is adversarial toward your own system: you're trying to discover where it breaks before a customer does, not assemble evidence that it's fine.

Why benchmarks mislead

A benchmark is a proxy for capability, and proxies rot the moment they become targets — that's Goodhart's law. Three things go wrong:

Teaching to the test. A model tuned to score well on a benchmark gets better at the benchmark, not necessarily at the job.
Contamination. If the test questions leaked into training data, the score measures memorization, not reasoning.
Wrong task. A score on generic problems rarely predicts performance on your specific, messy, real-world workflow.

The ladder of evidence

Not all evidence is equal. Rank it, and know which rung you're standing on:

Anecdote — "it worked when I tried it." The weakest rung; useful for hypotheses, useless for decisions.
Benchmark — a standard test. Better, but a proxy, and gameable.
Controlled test — your task, with a control that rules out luck and artifacts. Now you're measuring something.
Adversarial test — cases built to break it. This is where real weaknesses surface.
Production behavior — measured performance under real load and real inputs. The strongest evidence there is.

The discipline: controls and honest reporting

Two habits separate a trustworthy result from a flattering one. First, a control: a comparison that proves your number isn't an accident — shuffle the labels, run the baseline, check whether a dumb method scores the same. Second, honest reporting: state plainly what the evidence shows and what it doesn't. The Empire mantra for this is "above chance, not an oracle." A team that reports its own ceilings and failures is one whose numbers you can actually trust — and that honesty is, counterintuitively, the most convincing result of all.

Frequently asked

How do you evaluate an AI system?

Test it on the work it will really do, with adversarial cases and controls that prove the result is real — aiming for a true number, not a high one.

Why are benchmark scores misleading?

A benchmark is a proxy, and proxies get gamed, contaminated, or mismatched to your task. A better score is not a better system.

Does higher accuracy mean a better AI?

Not by itself — accuracy can rise from an easier test, a leak, or a lucky run. Without controls, a higher number can hide a worse system.

Go deeper

The field manual behind this note

This is the short version. The full discipline — the ladder of evidence, controls, adversarial evaluation, and honest reporting that turn a score into trust — is Evaluating AI Systems, the third book in the Empire Publishing Reliable AI trilogy (releasing soon — get notified). Already live and closely related: The Glass Box, on reading what a model you own actually knows from the inside.

Evaluating AI Systems → The Glass Box · $9.99 (live)

← More field notes