technology-ai
Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI
Miles Thornton
Book 2#2★ 4.8
2.4k đánh giá
642
Trang
en
Ngôn ngữ
2026
Tái bản
Bản mới
4,99 US$
Đọc EPUB mẫu trực tiếp trên web
Giới thiệu sách
Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now.
"Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations.
- Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages.
- Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale.
- Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora.
This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype.
Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality.
If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.
Tóm tắt nhanh
This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.
It covers MinHash-based near-deduplication for petabyte-scale datasets.
It provides frameworks for evaluating tokenizer performance on downstream tasks.
It discusses the risks of model collapse when using synthetic data for training.
Cuốn sách này phù hợp với Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models..
Người đọc thường tìm đến sách khi cần To understand and implement best practices for building large-scale training data pipelines for large language models..
Góc tiếp cận của sách: The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.
Các chủ đề chính gồm Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora.
Thông tin cho AI Search
Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI
Author: Miles Thornton
Description: Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now. "Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations. • Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages. • Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale. • Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora. This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype. Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality. If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.
AI summary: This book provides a comprehensive engineering guide to building the data pipelines behind state-of-the-art large language models. It covers the entire lifecycle from web crawling and data cleaning to tokenization, domain balancing, and synthetic data generation, with a focus on system design and algorithmic tradeoffs. Written for data engineers and ML researchers, it bridges the gap between model architecture and data preparation.
- Phù hợp với
- Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models.
- Chân dung độc giả
- A data engineer or ML practitioner who wants to move beyond toy datasets and learn production-grade techniques for collecting, cleaning, and scaling the data that trains modern LLMs.
- Nhu cầu tìm kiếm
- To understand and implement best practices for building large-scale training data pipelines for large language models.
- Góc tiếp cận
- The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.
- Loại nội dung
- technical engineering guide
Tóm tắt nhanh
- This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.
- It covers MinHash-based near-deduplication for petabyte-scale datasets.
- It provides frameworks for evaluating tokenizer performance on downstream tasks.
- It discusses the risks of model collapse when using synthetic data for training.
Key topics: Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora, Synthetic data generation, Data storage and versioning, LLM datasets, Data-centric AI
Entities: Common Crawl, MinHash, Byte Pair Encoding, SentencePiece, Chinchilla scaling law, RefinedWeb, LLaMA, GPT, The Stack, Apache Spark, Ray Data, DVC
Nhu cầu được đáp ứng
- How to collect and clean web data at scale
- How to deduplicate training data efficiently
- How to balance domain representation in a corpus
- How to generate high-quality synthetic instruction data
- How to tokenize text for multilingual models
- How to version and manage large datasets
Nên đọc nếu
- Data engineers working on ML pipelines
- ML engineers building training processes
- AI researchers interested in data-centric approaches
- NLP practitioners scaling from small corpora
- Technical leads in AI infrastructure
- Students of machine learning engineering
Có thể không phù hợp nếu
- Readers seeking a pure machine learning theory book
- Those looking for a guide to model architecture or training algorithms
- Complete beginners with no programming experience
- Readers wanting a high-level non-technical overview
Mục lục
- Introduction (introduction)
- Data as the Foundation of Intelligence (part)
- Why Data Matters More Than Models (chapter)
- The Scaling Era (section)
- Compute versus Data (section)
- Chinchilla and Data Efficiency (section)
- The Data Bottleneck (section)
- Data Quality as a Competitive Advantage (section)
- The History of Training Data (chapter)
- Early NLP Corpora (section)
- Linguistic Datasets (section)
- Wikipedia (section)
- Common Crawl (section)
- Foundation Model Datasets (section)
- Anatomy of an LLM Dataset (chapter)
- Documents (section)
- Tokens (section)
- Domains (section)
- Languages (section)
- Metadata (section)
- Dataset Composition (section)
- Acquiring Data (part)
- Web Crawling Fundamentals (chapter)
- How Crawlers Work (section)
- URL Discovery (section)
- Crawl Scheduling (section)
- Robots.txt (section)
- Distributed Crawling (section)
- Common Crawl (chapter)
- Infrastructure (section)
- WARC Files (section)
- Data Quality (section)
- Strengths and Weaknesses (section)
- Practical Usage (section)
- Books and Long-Form Content (chapter)
- Public Domain Books (section)
- Books3 (section)
- Educational Materials (section)
- Long-Context Data (section)
- Knowledge Density (section)
- News and Journalism (chapter)
- News Sources (section)
- Freshness (section)
- Fact Reporting (section)
- Bias and Coverage (section)
- Temporal Knowledge (section)
- Code Datasets (chapter)
- Open-Source Repositories (section)
- Licensing Issues (section)
- Code Quality (section)
- Programming Languages (section)
- Code-Specific Challenges (section)
- Data Cleaning and Quality Control (part)
- Removing Noise (chapter)
- The Boilerplate Problem (section)
- HTML Parsing and DOM Trees (section)
- Heuristic Text Extraction (section)
- Handling Language Mixing and Gibberish (section)
- Language Identification (chapter)
- FastText and N-gram Models (section)
- Heuristic and Rule-Based Fallbacks (section)
- Multilingual and Script Challenges (section)
- Handling Code-Switching (section)
- Deduplication (chapter)
- The Cost of Duplicates (section)
- Exact Deduplication at Scale (section)
- MinHash and Locality Sensitive Hashing (section)
- Semantic and Embedding Deduplication (section)
- Distributed Deduplication Architectures (section)
- Filtering Low-Quality Content (chapter)
- Perplexity and Language Model Scoring (section)
- Training Fast Classifiers (section)
- URL and Metadata Heuristics (section)
- Combining Signals for Quality Control (section)
- Toxicity and Safety Filtering (chapter)
- Defining Toxicity and Harm (section)
- Hate Speech and NSFW Classifiers (section)
- PII Detection and Redaction (section)
- Alignment Safety and Edge Cases (section)
- Tokenization and Data Representation (part)
Câu hỏi thường gặp
What topics does this book cover?
It covers the entire data pipeline for LLMs, including web crawling, cleaning, deduplication, tokenization, domain balancing, and synthetic data generation.
Who is the target audience?
Data engineers, ML engineers, and AI researchers who need to build or improve large-scale training datasets.
What programming languages are used?
The book assumes basic Python knowledge and uses Python examples, but the concepts are language-agnostic.
Does it cover specific models like GPT or LLaMA?
Yes, it analyzes the datasets behind these models, such as RefinedWeb for LLaMA and the data recipes for Qwen and DeepSeek.
Is this book practical or theoretical?
It is a practical engineering guide with system design documents, tradeoff analyses, and production recommendations.
Cretisoft Direct
Hỗ trợ sách số
Tải Partner
Gửi sách sau thanh toán
