Home / Learn / How do you know what an AI can actually do?
Field note · Reliable AI
Published June 28, 2026 · Empire Publishing
Short answer: You test it against the work it will really do — not a leaderboard — with adversarial cases and controls that prove the result is real. The trap to avoid fits in one line: a better score is not a better system. The hard part isn't producing a number; it's producing a number that's actually true about your AI.
Most AI evaluation is quietly designed to produce a number that lets you ship — a green checkmark, a benchmark you topped, a demo that worked on stage. That's the opposite of finding out what's true. Real evaluation is adversarial toward your own system: you're trying to discover where it breaks before a customer does, not assemble evidence that it's fine.
A benchmark is a proxy for capability, and proxies rot the moment they become targets — that's Goodhart's law. Three things go wrong:
Not all evidence is equal. Rank it, and know which rung you're standing on:
Two habits separate a trustworthy result from a flattering one. First, a control: a comparison that proves your number isn't an accident — shuffle the labels, run the baseline, check whether a dumb method scores the same. Second, honest reporting: state plainly what the evidence shows and what it doesn't. The Empire mantra for this is "above chance, not an oracle." A team that reports its own ceilings and failures is one whose numbers you can actually trust — and that honesty is, counterintuitively, the most convincing result of all.
Test it on the work it will really do, with adversarial cases and controls that prove the result is real — aiming for a true number, not a high one.
A benchmark is a proxy, and proxies get gamed, contaminated, or mismatched to your task. A better score is not a better system.
Not by itself — accuracy can rise from an easier test, a leak, or a lucky run. Without controls, a higher number can hide a worse system.
Go deeper
This is the short version. The full discipline — the ladder of evidence, controls, adversarial evaluation, and honest reporting that turn a score into trust — is Evaluating AI Systems, the third book in the Empire Publishing Reliable AI trilogy (releasing soon — get notified). Already live and closely related: The Glass Box, on reading what a model you own actually knows from the inside.