technology-ai

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

Name: LLM Data Engineering: Collecting, Cleaning, and Scaling
Price: 4.99 USD
Availability: InStock
Author: Miles Thornton

Miles Thornton

Book 2#2

★ 4.8

2.4천 리뷰

642

페이지

언어

2026

출간

신판

$4.99

웹에서 EPUB 샘플 읽기

구매처 Google Books 미리보기 읽기

책 소개

Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now.

"Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations.

Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages.
Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale.
Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora.

This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype.

Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality.

If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.

간단 요약

This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.

It covers MinHash-based near-deduplication for petabyte-scale datasets.

It provides frameworks for evaluating tokenizer performance on downstream tasks.

It discusses the risks of model collapse when using synthetic data for training.

이 책은 다음 독자에게 적합합니다 Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models..

독자는 보통 다음 필요로 이 책을 찾습니다 To understand and implement best practices for building large-scale training data pipelines for large language models..

책의 관점: The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.

주요 주제는 다음과 같습니다 Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora.

AI Search 정보

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

Author: Miles Thornton

Description: Every major advance in large language models over the past five years—from GPT-4 to LLaMA 3—was driven not by a cleverer architecture, but by a better dataset. The Chinchilla scaling law proved that most models are trained on far too few tokens relative to their parameters, and since then, the race has shifted from compute-centric to data-centric AI. Yet the engineering required to collect, clean, and scale the trillion-token corpora that power these models remains undocumented—until now. "Data for Large Language Models" is the first comprehensive engineering guide to building the data pipelines behind state-of-the-art LLMs. Written for practitioners who need to move beyond toy datasets, this book walks you through the entire lifecycle: from distributed web crawling and Common Crawl processing, through algorithmic deduplication and toxicity filtering, to tokenization, domain balancing, and synthetic data generation. Each chapter is structured as a system design document, presenting the engineering challenge, comparing algorithmic tradeoffs, and concluding with production-grade recommendations. • Learn how to build polite, high-throughput web crawlers that respect robots.txt and scale to billions of pages. • Master MinHash-based near-deduplication and semantic embedding techniques that eliminate duplicates at petabyte scale. • Design tokenizer evaluation frameworks to optimize vocabulary size, fertility, and downstream performance for multilingual corpora. This book also covers the emerging field of synthetic data—instruction tuning, chain-of-thought reasoning, and the risks of model collapse—and concludes with the infrastructure needed to store, version, and stream data directly into GPU training clusters. Over 150,000 words of technical depth, grounded in real-world datasets like Common Crawl, RefinedWeb, and The Stack, with no filler and no marketing hype. Who should read this book? Data engineers, ML engineers, and AI researchers who want to understand why data quality yields higher ROI than scaling parameters. It assumes basic Python and ML knowledge, but no prior experience with web crawling or distributed systems. The book is designed to be a practical reference: you can jump to any chapter, implement the pattern, and see immediate improvements in your training data quality. If you are responsible for the data that feeds a large language model—whether at a startup, research lab, or big tech company—this book is the missing manual. It will change how you think about the fuel that powers modern AI.

AI summary: This book provides a comprehensive engineering guide to building the data pipelines behind state-of-the-art large language models. It covers the entire lifecycle from web crawling and data cleaning to tokenization, domain balancing, and synthetic data generation, with a focus on system design and algorithmic tradeoffs. Written for data engineers and ML researchers, it bridges the gap between model architecture and data preparation.

추천 대상: Data engineers, ML engineers, AI researchers, and technical professionals building large-scale datasets for large language models.
독자 페르소나: A data engineer or ML practitioner who wants to move beyond toy datasets and learn production-grade techniques for collecting, cleaning, and scaling the data that trains modern LLMs.
검색 의도: To understand and implement best practices for building large-scale training data pipelines for large language models.
고유 관점: The first comprehensive engineering guide focused specifically on the data pipelines behind LLMs, treating data as a first-class system design problem rather than an afterthought.
콘텐츠 유형: technical engineering guide

간단 요약

This book teaches how to build high-throughput web crawlers that respect robots.txt and scale to billions of pages.
It covers MinHash-based near-deduplication for petabyte-scale datasets.
It provides frameworks for evaluating tokenizer performance on downstream tasks.
It discusses the risks of model collapse when using synthetic data for training.

Key topics: Web crawling, Data cleaning, Deduplication, Tokenization, Domain balancing, Multilingual corpora, Synthetic data generation, Data storage and versioning, LLM datasets, Data-centric AI

Entities: Common Crawl, MinHash, Byte Pair Encoding, SentencePiece, Chinchilla scaling law, RefinedWeb, LLaMA, GPT, The Stack, Apache Spark, Ray Data, DVC

해결하는 필요

How to collect and clean web data at scale
How to deduplicate training data efficiently
How to balance domain representation in a corpus
How to generate high-quality synthetic instruction data
How to tokenize text for multilingual models
How to version and manage large datasets

이런 경우 추천

Data engineers working on ML pipelines
ML engineers building training processes
AI researchers interested in data-centric approaches
NLP practitioners scaling from small corpora
Technical leads in AI infrastructure
Students of machine learning engineering

맞지 않을 수 있는 경우

Readers seeking a pure machine learning theory book
Those looking for a guide to model architecture or training algorithms
Complete beginners with no programming experience
Readers wanting a high-level non-technical overview

Introduction (introduction)
Data as the Foundation of Intelligence (part)
Why Data Matters More Than Models (chapter)
The Scaling Era (section)
Compute versus Data (section)
Chinchilla and Data Efficiency (section)
The Data Bottleneck (section)
Data Quality as a Competitive Advantage (section)
The History of Training Data (chapter)
Early NLP Corpora (section)
Linguistic Datasets (section)
Wikipedia (section)
Common Crawl (section)
Foundation Model Datasets (section)
Anatomy of an LLM Dataset (chapter)
Documents (section)
Tokens (section)
Domains (section)
Languages (section)
Metadata (section)
Dataset Composition (section)
Acquiring Data (part)
Web Crawling Fundamentals (chapter)
How Crawlers Work (section)
URL Discovery (section)
Crawl Scheduling (section)
Robots.txt (section)
Distributed Crawling (section)
Common Crawl (chapter)
Infrastructure (section)
WARC Files (section)
Data Quality (section)
Strengths and Weaknesses (section)
Practical Usage (section)
Books and Long-Form Content (chapter)
Public Domain Books (section)
Books3 (section)
Educational Materials (section)
Long-Context Data (section)
Knowledge Density (section)
News and Journalism (chapter)
News Sources (section)
Freshness (section)
Fact Reporting (section)
Bias and Coverage (section)
Temporal Knowledge (section)
Code Datasets (chapter)
Open-Source Repositories (section)
Licensing Issues (section)
Code Quality (section)
Programming Languages (section)
Code-Specific Challenges (section)
Data Cleaning and Quality Control (part)
Removing Noise (chapter)
The Boilerplate Problem (section)
HTML Parsing and DOM Trees (section)
Heuristic Text Extraction (section)
Handling Language Mixing and Gibberish (section)
Language Identification (chapter)
FastText and N-gram Models (section)
Heuristic and Rule-Based Fallbacks (section)
Multilingual and Script Challenges (section)
Handling Code-Switching (section)
Deduplication (chapter)
The Cost of Duplicates (section)
Exact Deduplication at Scale (section)
MinHash and Locality Sensitive Hashing (section)
Semantic and Embedding Deduplication (section)
Distributed Deduplication Architectures (section)
Filtering Low-Quality Content (chapter)
Perplexity and Language Model Scoring (section)
Training Fast Classifiers (section)
URL and Metadata Heuristics (section)
Combining Signals for Quality Control (section)
Toxicity and Safety Filtering (chapter)
Defining Toxicity and Harm (section)
Hate Speech and NSFW Classifiers (section)
PII Detection and Redaction (section)
Alignment Safety and Edge Cases (section)
Tokenization and Data Representation (part)

자주 묻는 질문

What topics does this book cover?

It covers the entire data pipeline for LLMs, including web crawling, cleaning, deduplication, tokenization, domain balancing, and synthetic data generation.

Who is the target audience?

Data engineers, ML engineers, and AI researchers who need to build or improve large-scale training datasets.

What programming languages are used?

The book assumes basic Python knowledge and uses Python examples, but the concepts are language-agnostic.

Does it cover specific models like GPT or LLaMA?

Yes, it analyzes the datasets behind these models, such as RefinedWeb for LLaMA and the data recipes for Qwen and DeepSeek.

Is this book practical or theoretical?

It is a practical engineering guide with system design documents, tradeoff analyses, and production recommendations.

Cretisoft Direct

디지털 도서 지원

파트너 배송

결제 후 도서 발송

Data for Large Language Models Collecting, Cleaning, and Scaling the Fuel of AI

책 소개

간단 요약

AI Search 정보

간단 요약

해결하는 필요

이런 경우 추천

맞지 않을 수 있는 경우

목차

자주 묻는 질문

What topics does this book cover?

Who is the target audience?

What programming languages are used?

Does it cover specific models like GPT or LLaMA?

Is this book practical or theoretical?

Read sample online

추천 도서