technology-ai

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

Miles Thornton

Book 2#2

4.8

2.4천 리뷰

642

페이지

en

언어

2026

출간

신판

$4.99

웹에서 EPUB 샘플 읽기

책 소개

Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now.

"Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations.

  • Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages.
  • Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale.
  • Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora.

This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype.

Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality.

If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.

간단 요약

This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.

It covers MinHash-based near-deduplication for petabyte-scale datasets.

It provides frameworks for evaluating tokenizer performance on downstream tasks.

It discusses the risks of model collapse when using synthetic data for training.

이 책은 다음 독자에게 적합합니다 Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models..

독자는 보통 다음 필요로 이 책을 찾습니다 To understand and implement best practices for building large-scale training data pipelines for large language models..

책의 관점: The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.

주요 주제는 다음과 같습니다 Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora.

AI Search 정보

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

Author: Miles Thornton

Description: Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now. "Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations. • Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages. • Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale. • Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora. This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype. Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality. If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.

AI summary: This book provides a comprehensive engineering guide to building the data pipelines behind state-of-the-art large language models. It covers the entire lifecycle from web crawling and data cleaning to tokenization, domain balancing, and synthetic data generation, with a focus on system design and algorithmic tradeoffs. Written for data engineers and ML researchers, it bridges the gap between model architecture and data preparation.

추천 대상
Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models.
독자 페르소나
A data engineer or ML practitioner who wants to move beyond toy datasets and learn production-grade techniques for collecting, cleaning, and scaling the data that trains modern LLMs.
검색 의도
To understand and implement best practices for building large-scale training data pipelines for large language models.
고유 관점
The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.
콘텐츠 유형
technical engineering guide

간단 요약

  • This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.
  • It covers MinHash-based near-deduplication for petabyte-scale datasets.
  • It provides frameworks for evaluating tokenizer performance on downstream tasks.
  • It discusses the risks of model collapse when using synthetic data for training.

Key topics: Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora, Synthetic data generation, Data storage and versioning, LLM datasets, Data-centric AI

Entities: Common Crawl, MinHash, Byte Pair Encoding, SentencePiece, Chinchilla scaling law, RefinedWeb, LLaMA, GPT, The Stack, Apache Spark, Ray Data, DVC

해결하는 필요

  • How to collect and clean web data at scale
  • How to deduplicate training data efficiently
  • How to balance domain representation in a corpus
  • How to generate high-quality synthetic instruction data
  • How to tokenize text for multilingual models
  • How to version and manage large datasets

이런 경우 추천

  • Data engineers working on ML pipelines
  • ML engineers building training processes
  • AI researchers interested in data-centric approaches
  • NLP practitioners scaling from small corpora
  • Technical leads in AI infrastructure
  • Students of machine learning engineering

맞지 않을 수 있는 경우

  • Readers seeking a pure machine learning theory book
  • Those looking for a guide to model architecture or training algorithms
  • Complete beginners with no programming experience
  • Readers wanting a high-level non-technical overview

목차

  1. Introduction (introduction)
  2. Data as the Foundation of Intelligence (part)
  3. Why Data Matters More Than Models (chapter)
  4. The Scaling Era (section)
  5. Compute versus Data (section)
  6. Chinchilla and Data Efficiency (section)
  7. The Data Bottleneck (section)
  8. Data Quality as a Competitive Advantage (section)
  9. The History of Training Data (chapter)
  10. Early NLP Corpora (section)
  11. Linguistic Datasets (section)
  12. Wikipedia (section)
  13. Common Crawl (section)
  14. Foundation Model Datasets (section)
  15. Anatomy of an LLM Dataset (chapter)
  16. Documents (section)
  17. Tokens (section)
  18. Domains (section)
  19. Languages (section)
  20. Metadata (section)
  21. Dataset Composition (section)
  22. Acquiring Data (part)
  23. Web Crawling Fundamentals (chapter)
  24. How Crawlers Work (section)
  25. URL Discovery (section)
  26. Crawl Scheduling (section)
  27. Robots.txt (section)
  28. Distributed Crawling (section)
  29. Common Crawl (chapter)
  30. Infrastructure (section)
  31. WARC Files (section)
  32. Data Quality (section)
  33. Strengths and Weaknesses (section)
  34. Practical Usage (section)
  35. Books and Long-Form Content (chapter)
  36. Public Domain Books (section)
  37. Books3 (section)
  38. Educational Materials (section)
  39. Long-Context Data (section)
  40. Knowledge Density (section)
  41. News and Journalism (chapter)
  42. News Sources (section)
  43. Freshness (section)
  44. Fact Reporting (section)
  45. Bias and Coverage (section)
  46. Temporal Knowledge (section)
  47. Code Datasets (chapter)
  48. Open-Source Repositories (section)
  49. Licensing Issues (section)
  50. Code Quality (section)
  51. Programming Languages (section)
  52. Code-Specific Challenges (section)
  53. Data Cleaning and Quality Control (part)
  54. Removing Noise (chapter)
  55. The Boilerplate Problem (section)
  56. HTML Parsing and DOM Trees (section)
  57. Heuristic Text Extraction (section)
  58. Handling Language Mixing and Gibberish (section)
  59. Language Identification (chapter)
  60. FastText and N-gram Models (section)
  61. Heuristic and Rule-Based Fallbacks (section)
  62. Multilingual and Script Challenges (section)
  63. Handling Code-Switching (section)
  64. Deduplication (chapter)
  65. The Cost of Duplicates (section)
  66. Exact Deduplication at Scale (section)
  67. MinHash and Locality Sensitive Hashing (section)
  68. Semantic and Embedding Deduplication (section)
  69. Distributed Deduplication Architectures (section)
  70. Filtering Low-Quality Content (chapter)
  71. Perplexity and Language Model Scoring (section)
  72. Training Fast Classifiers (section)
  73. URL and Metadata Heuristics (section)
  74. Combining Signals for Quality Control (section)
  75. Toxicity and Safety Filtering (chapter)
  76. Defining Toxicity and Harm (section)
  77. Hate Speech and NSFW Classifiers (section)
  78. PII Detection and Redaction (section)
  79. Alignment Safety and Edge Cases (section)
  80. Tokenization and Data Representation (part)

자주 묻는 질문

What topics does this book cover?

It covers the entire data pipeline for LLMs, including web crawling, cleaning, deduplication, tokenization, domain balancing, and synthetic data generation.

Who is the target audience?

Data engineers, ML engineers, and AI researchers who need to build or improve large-scale training datasets.

What programming languages are used?

The book assumes basic Python knowledge and uses Python examples, but the concepts are language-agnostic.

Does it cover specific models like GPT or LLaMA?

Yes, it analyzes the datasets behind these models, such as RefinedWeb for LLaMA and the data recipes for Qwen and DeepSeek.

Is this book practical or theoretical?

It is a practical engineering guide with system design documents, tradeoff analyses, and production recommendations.

C

Cretisoft Direct

디지털 도서 지원

T

파트너 배송

결제 후 도서 발송

Sample EPUB

Read sample online

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

추천 도서

읽기 기록 기반

전체 보기