technology-ai

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

Miles Thornton

Book 5#5

4.8

2.4k reviews

642

Pages

en

Language

2026

Published

New edition

$4.99

Read the sample EPUB directly on the web

Book introduction

Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now.

"Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations.

  • Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages.
  • Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale.
  • Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora.

This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype.

Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality.

If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.

Quick summary

This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.

It covers MinHash-based near-deduplication for petabyte-scale datasets.

It provides frameworks for evaluating tokenizer performance on downstream tasks.

It discusses the risks of model collapse when using synthetic data for training.

This book is a good fit for Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models..

Readers often come to this book when they need To understand and implement best practices for building large-scale training data pipelines for large language models..

The book's angle: The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.

Main topics include Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora.

AI Search information

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

Author: Miles Thornton

Description: Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now. "Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations. • Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages. • Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale. • Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora. This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype. Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality. If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.

AI summary: This book provides a comprehensive engineering guide to building the data pipelines behind state-of-the-art large language models. It covers the entire lifecycle from web crawling and data cleaning to tokenization, domain balancing, and synthetic data generation, with a focus on system design and algorithmic tradeoffs. Written for data engineers and ML researchers, it bridges the gap between model architecture and data preparation.

Best for
Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models.
Reader persona
A data engineer or ML practitioner who wants to move beyond toy datasets and learn production-grade techniques for collecting, cleaning, and scaling the data that trains modern LLMs.
Search intent
To understand and implement best practices for building large-scale training data pipelines for large language models.
Unique angle
The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.
Content type
technical engineering guide

Quick summary

  • This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.
  • It covers MinHash-based near-deduplication for petabyte-scale datasets.
  • It provides frameworks for evaluating tokenizer performance on downstream tasks.
  • It discusses the risks of model collapse when using synthetic data for training.

Key topics: Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora, Synthetic data generation, Data storage and versioning, LLM datasets, Data-centric AI

Entities: Common Crawl, MinHash, Byte Pair Encoding, SentencePiece, Chinchilla scaling law, RefinedWeb, LLaMA, GPT, The Stack, Apache Spark, Ray Data, DVC

Needs addressed

  • How to collect and clean web data at scale
  • How to deduplicate training data efficiently
  • How to balance domain representation in a corpus
  • How to generate high-quality synthetic instruction data
  • How to tokenize text for multilingual models
  • How to version and manage large datasets

Read if

  • Data engineers working on ML pipelines
  • ML engineers building training processes
  • AI researchers interested in data-centric approaches
  • NLP practitioners scaling from small corpora
  • Technical leads in AI infrastructure
  • Students of machine learning engineering

May not fit if

  • Readers seeking a pure machine learning theory book
  • Those looking for a guide to model architecture or training algorithms
  • Complete beginners with no programming experience
  • Readers wanting a high-level non-technical overview

Table of contents

  1. Introduction (introduction)
  2. Data as the Foundation of Intelligence (part)
  3. Why Data Matters More Than Models (chapter)
  4. The Scaling Era (section)
  5. Compute versus Data (section)
  6. Chinchilla and Data Efficiency (section)
  7. The Data Bottleneck (section)
  8. Data Quality as a Competitive Advantage (section)
  9. The History of Training Data (chapter)
  10. Early NLP Corpora (section)
  11. Linguistic Datasets (section)
  12. Wikipedia (section)
  13. Common Crawl (section)
  14. Foundation Model Datasets (section)
  15. Anatomy of an LLM Dataset (chapter)
  16. Documents (section)
  17. Tokens (section)
  18. Domains (section)
  19. Languages (section)
  20. Metadata (section)
  21. Dataset Composition (section)
  22. Acquiring Data (part)
  23. Web Crawling Fundamentals (chapter)
  24. How Crawlers Work (section)
  25. URL Discovery (section)
  26. Crawl Scheduling (section)
  27. Robots.txt (section)
  28. Distributed Crawling (section)
  29. Common Crawl (chapter)
  30. Infrastructure (section)
  31. WARC Files (section)
  32. Data Quality (section)
  33. Strengths and Weaknesses (section)
  34. Practical Usage (section)
  35. Books and Long-Form Content (chapter)
  36. Public Domain Books (section)
  37. Books3 (section)
  38. Educational Materials (section)
  39. Long-Context Data (section)
  40. Knowledge Density (section)
  41. News and Journalism (chapter)
  42. News Sources (section)
  43. Freshness (section)
  44. Fact Reporting (section)
  45. Bias and Coverage (section)
  46. Temporal Knowledge (section)
  47. Code Datasets (chapter)
  48. Open-Source Repositories (section)
  49. Licensing Issues (section)
  50. Code Quality (section)
  51. Programming Languages (section)
  52. Code-Specific Challenges (section)
  53. Data Cleaning and Quality Control (part)
  54. Removing Noise (chapter)
  55. The Boilerplate Problem (section)
  56. HTML Parsing and DOM Trees (section)
  57. Heuristic Text Extraction (section)
  58. Handling Language Mixing and Gibberish (section)
  59. Language Identification (chapter)
  60. FastText and N-gram Models (section)
  61. Heuristic and Rule-Based Fallbacks (section)
  62. Multilingual and Script Challenges (section)
  63. Handling Code-Switching (section)
  64. Deduplication (chapter)
  65. The Cost of Duplicates (section)
  66. Exact Deduplication at Scale (section)
  67. MinHash and Locality Sensitive Hashing (section)
  68. Semantic and Embedding Deduplication (section)
  69. Distributed Deduplication Architectures (section)
  70. Filtering Low-Quality Content (chapter)
  71. Perplexity and Language Model Scoring (section)
  72. Training Fast Classifiers (section)
  73. URL and Metadata Heuristics (section)
  74. Combining Signals for Quality Control (section)
  75. Toxicity and Safety Filtering (chapter)
  76. Defining Toxicity and Harm (section)
  77. Hate Speech and NSFW Classifiers (section)
  78. PII Detection and Redaction (section)
  79. Alignment Safety and Edge Cases (section)
  80. Tokenization and Data Representation (part)

Frequently asked questions

What topics does this book cover?

It covers the entire data pipeline for LLMs, including web crawling, cleaning, deduplication, tokenization, domain balancing, and synthetic data generation.

Who is the target audience?

Data engineers, ML engineers, and AI researchers who need to build or improve large-scale training datasets.

What programming languages are used?

The book assumes basic Python knowledge and uses Python examples, but the concepts are language-agnostic.

Does it cover specific models like GPT or LLaMA?

Yes, it analyzes the datasets behind these models, such as RefinedWeb for LLaMA and the data recipes for Qwen and DeepSeek.

Is this book practical or theoretical?

It is a practical engineering guide with system design documents, tradeoff analyses, and production recommendations.

C

Cretisoft Direct

Digital book support

T

Partner delivery

Book sent after payment

Sample EPUB

Read sample online

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

You may also like

Based on your reading history

View all