Home  /  Learn  /  How do you know what an AI can actually do?

Field note · Reliable AI

How Do You Know What an AI Can Actually Do?

Published June 28, 2026 · Empire Publishing

Short answer: You test it against the work it will really do — not a leaderboard — with adversarial cases and controls that prove the result is real. The trap to avoid fits in one line: a better score is not a better system. The hard part isn't producing a number; it's producing a number that's actually true about your AI.

The number that lets you ship

Most AI evaluation is quietly designed to produce a number that lets you ship — a green checkmark, a benchmark you topped, a demo that worked on stage. That's the opposite of finding out what's true. Real evaluation is adversarial toward your own system: you're trying to discover where it breaks before a customer does, not assemble evidence that it's fine.

Why benchmarks mislead

A benchmark is a proxy for capability, and proxies rot the moment they become targets — that's Goodhart's law. Three things go wrong:

  • Teaching to the test. A model tuned to score well on a benchmark gets better at the benchmark, not necessarily at the job.
  • Contamination. If the test questions leaked into training data, the score measures memorization, not reasoning.
  • Wrong task. A score on generic problems rarely predicts performance on your specific, messy, real-world workflow.

The ladder of evidence

Not all evidence is equal. Rank it, and know which rung you're standing on:

  1. Anecdote — "it worked when I tried it." The weakest rung; useful for hypotheses, useless for decisions.
  2. Benchmark — a standard test. Better, but a proxy, and gameable.
  3. Controlled test — your task, with a control that rules out luck and artifacts. Now you're measuring something.
  4. Adversarial test — cases built to break it. This is where real weaknesses surface.
  5. Production behavior — measured performance under real load and real inputs. The strongest evidence there is.

The discipline: controls and honest reporting

Two habits separate a trustworthy result from a flattering one. First, a control: a comparison that proves your number isn't an accident — shuffle the labels, run the baseline, check whether a dumb method scores the same. Second, honest reporting: state plainly what the evidence shows and what it doesn't. The Empire mantra for this is "above chance, not an oracle." A team that reports its own ceilings and failures is one whose numbers you can actually trust — and that honesty is, counterintuitively, the most convincing result of all.

Frequently asked

How do you evaluate an AI system?

Test it on the work it will really do, with adversarial cases and controls that prove the result is real — aiming for a true number, not a high one.

Why are benchmark scores misleading?

A benchmark is a proxy, and proxies get gamed, contaminated, or mismatched to your task. A better score is not a better system.

Does higher accuracy mean a better AI?

Not by itself — accuracy can rise from an easier test, a leak, or a lucky run. Without controls, a higher number can hide a worse system.

Go deeper

The field manual behind this note

This is the short version. The full discipline — the ladder of evidence, controls, adversarial evaluation, and honest reporting that turn a score into trust — is Evaluating AI Systems, the third book in the Empire Publishing Reliable AI trilogy (releasing soon — get notified). Already live and closely related: The Glass Box, on reading what a model you own actually knows from the inside.

← More field notes