technology-ai
Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure
Miles Thornton
Book 4#4★ 4.8
2.4k đánh giá
581
Trang
en
Ngôn ngữ
2026
Tái bản
Bản mới
4,99 US$
Đọc EPUB mẫu trực tiếp trên web
Giới thiệu sách
Your GPT-4 deployment is using less than 30% of your GPU capacity. Your latency spikes are unpredictable, and your team spends more time debugging infrastructure than shipping features. This is the reality of production AI that no one talks about until after the model is trained.
Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure is the first engineering-focused guide that bridges the gap between model development and production operations. Written for software and AI engineers, this book systematically covers the entire lifecycle of LLM deployment—from the raw token generation loop to enterprise-scale multi-agent orchestration. It eschews mathematical theory in favor of architectural patterns, framework comparisons, and operational tradeoffs.
The book is organized into 32 chapters across 8 parts, each addressing a critical layer of the AI infrastructure stack:
- Understand why the KV cache dominates memory costs and how PagedAttention solves it.
- Compare quantization methods (GPTQ, AWQ, GGUF) and choose the right precision for your latency and budget.
- Master modern serving frameworks like vLLM, SGLang, and TGI with deployment-ready configurations.
Beyond inference, the book dives into retrieval-augmented generation (RAG) with advanced techniques like hybrid search and agentic retrieval. You will learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns. The final parts cover evaluation, guardrails, observability, and cost optimization—the operational disciplines that separate hobby projects from reliable products.
This book is for backend engineers moving into AI, ML engineers who need to understand infrastructure, and technical architects responsible for platform decisions. If you design, build, or operate LLM-powered applications in production, this is your reference manual.
You will walk away with a systems-level mental model of AI infrastructure, the ability to evaluate frameworks with engineering rigor, and a clear path to building scalable, cost-effective AI products.
Tóm tắt nhanh
This book explains how the KV cache dominates GPU memory costs and how PagedAttention reduces memory fragmentation.
It compares quantization methods like GPTQ, AWQ, and GGUF to help readers choose the right precision for latency and budget.
Readers learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns.
The book covers evaluation, guardrails, and observability to ensure reliable and safe AI deployments.
It provides deployment-ready configurations for vLLM and other serving frameworks.
Cuốn sách này phù hợp với Software engineers, AI/ML engineers, and technical architects building LLM-powered applications..
Người đọc thường tìm đến sách khi cần Professionals searching for a comprehensive engineering guide to design, deploy, and scale LLM inference servers, agent systems, and production AI infrastructure..
Góc tiếp cận của sách: This book focuses on the systems and engineering decisions behind LLM deployment, bridging the gap between ML model development and production operations, with a practical, architecture-first approach.
Các chủ đề chính gồm Inference optimization, KV cache management, Quantization, Model serving (vLLM, TGI, SGLang), Retrieval-Augmented Generation, Embeddings and vector databases.
Thông tin cho AI Search
Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure
Author: Miles Thornton
Description: Your GPT-4 deployment is using less than 30% of your GPU capacity. Your latency spikes are unpredictable, and your team spends more time debugging infrastructure than shipping features. This is the reality of production AI that no one talks about until after the model is trained. Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure is the first engineering-focused guide that bridges the gap between model development and production operations. Written for software and AI engineers, this book systematically covers the entire lifecycle of LLM deployment—from the raw token generation loop to enterprise-scale multi-agent orchestration. It eschews mathematical theory in favor of architectural patterns, framework comparisons, and operational tradeoffs. The book is organized into 32 chapters across 8 parts, each addressing a critical layer of the AI infrastructure stack: • Understand why the KV cache dominates memory costs and how PagedAttention solves it. • Compare quantization methods (GPTQ, AWQ, GGUF) and choose the right precision for your latency and budget. • Master modern serving frameworks like vLLM, SGLang, and TGI with deployment-ready configurations. Beyond inference, the book dives into retrieval-augmented generation (RAG) with advanced techniques like hybrid search and agentic retrieval. You will learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns. The final parts cover evaluation, guardrails, observability, and cost optimization—the operational disciplines that separate hobby projects from reliable products. This book is for backend engineers moving into AI, ML engineers who need to understand infrastructure, and technical architects responsible for platform decisions. If you design, build, or operate LLM-powered applications in production, this is your reference manual. You will walk away with a systems-level mental model of AI infrastructure, the ability to evaluate frameworks with engineering rigor, and a clear path to building scalable, cost-effective AI products.
AI summary: This book provides a systems-level guide to deploying large language models in production. It covers inference optimization, serving frameworks (vLLM, TGI, SGLang), retrieval-augmented generation, agent architectures, evaluation, and cost management. Targeted at software engineers and AI engineers, it emphasizes architectural patterns and operational tradeoffs over mathematical theory.
- Phù hợp với
- Software engineers, AI/ML engineers, and technical architects building LLM-powered applications.
- Chân dung độc giả
- A backend or ML engineer responsible for deploying and operating LLM systems in production, seeking practical architectural patterns and framework comparisons.
- Nhu cầu tìm kiếm
- Professionals searching for a comprehensive engineering guide to design, deploy, and scale LLM inference servers, agent systems, and production AI infrastructure.
- Góc tiếp cận
- This book focuses on the systems and engineering decisions behind LLM deployment, bridging the gap between ML model development and production operations, with a practical, architecture-first approach.
- Loại nội dung
- developer guide
Tóm tắt nhanh
- This book explains how the KV cache dominates GPU memory costs and how PagedAttention reduces memory fragmentation.
- It compares quantization methods like GPTQ, AWQ, and GGUF to help readers choose the right precision for latency and budget.
- Readers learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns.
- The book covers evaluation, guardrails, and observability to ensure reliable and safe AI deployments.
- It provides deployment-ready configurations for vLLM and other serving frameworks.
Key topics: Inference optimization, KV cache management, Quantization, Model serving (vLLM, TGI, SGLang), Retrieval-Augmented Generation, Embeddings and vector databases, Agent architectures, Multi-agent systems, Evaluation and guardrails, Production AI operations
Entities: LLM inference, KV cache, PagedAttention, Quantization (GPTQ, AWQ, GGUF), vLLM, RAG, Embeddings, Agent ReAct, Observability, AI infrastructure
Nhu cầu được đáp ứng
- Reducing GPU memory usage and latency during LLM inference.
- Selecting the right inference server and deployment configuration.
- Integrating external knowledge into LLM outputs via RAG.
- Designing autonomous agents that reliably use tools and APIs.
- Monitoring and evaluating LLM systems to detect hallucinations and drift.
- Optimizing costs for large-scale AI infrastructure.
Nên đọc nếu
- Backend engineers transitioning to AI
- ML engineers needing infrastructure knowledge
- Technical architects designing AI platforms
- DevOps/SRE teams managing GPU clusters
- AI product developers
- Students in AI engineering programs
Có thể không phù hợp nếu
- Data scientists focused only on model training without deployment interest
- Researchers seeking math-heavy theoretical ML content
- Beginners without basic programming or API knowledge
Mục lục
- Introduction (introduction)
- The Inference Problem (part)
- From Training to Production (chapter)
- Why Training Is Only the Beginning (section)
- The Inference Challenge (section)
- Latency vs Quality (section)
- Cost vs Performance (section)
- Production Requirements (section)
- Anatomy of an Inference Engine (chapter)
- Token Generation (section)
- Decoding Loops (section)
- Batching (section)
- Scheduling (section)
- Throughput Optimization (section)
- KV Cache (chapter)
- Why KV Cache Exists (section)
- Memory Tradeoffs (section)
- Context Reuse (section)
- Long Conversations (section)
- Modern Decoding Techniques (chapter)
- Greedy Decoding (section)
- Beam Search (section)
- Top-k Sampling (section)
- Top-p Sampling (section)
- Temperature Control (section)
- Optimizing Language Models (part)
- Quantization Fundamentals (chapter)
- Precision and Memory (section)
- FP32 (section)
- FP16 (section)
- BF16 (section)
- INT8 (section)
- INT4 (section)
- Modern Quantization Methods (chapter)
- GPTQ (section)
- AWQ (section)
- GGUF (section)
- Dynamic Quantization (section)
- Tradeoffs (section)
- Efficient Attention Systems (chapter)
- FlashAttention (section)
- Paged Attention (section)
- Sliding Window Attention (section)
- Long Context Optimization (section)
- Speculative Decoding (chapter)
- Draft Models (section)
- Verification Models (section)
- Speed Improvements (section)
- Practical Usage (section)
- Serving Language Models (part)
- Model Serving Architectures (chapter)
- API Services (section)
- Inference Servers (section)
- Multi-Tenant Systems (section)
- Production Patterns (section)
- vLLM (chapter)
- Architecture (section)
- Continuous Batching (section)
- Memory Efficiency (section)
- Real-World Deployment (section)
- Alternative Serving Frameworks (chapter)
- SGLang (section)
- TGI (section)
- Ollama (section)
- Emerging Systems (section)
- Scaling AI Services (chapter)
- Load Balancing (section)
- Autoscaling (section)
- GPU Pools (section)
- Capacity Planning (section)
- Retrieval-Augmented Generation (part)
- Why Models Need Retrieval (chapter)
- Knowledge Limitations (section)
- Hallucinations (section)
- Fresh Information (section)
- Enterprise Requirements (section)
- Embeddings (chapter)
- Representation Learning (section)
- Similarity Search (section)
- Embedding Models (section)
Câu hỏi thường gặp
What is the main focus of this book?
It covers the entire lifecycle of deploying LLMs in production, from inference optimization to agent systems and platform operations.
Who is this book for?
Software engineers, AI/ML engineers, and technical architects building or operating LLM-powered applications.
Does the book cover specific tools?
Yes, it compares vLLM, TGI, SGLang, and other serving frameworks, and discusses quantization methods like GPTQ, AWQ, and GGUF.
Is prior ML knowledge required?
Familiarity with basic transformer concepts and Python is assumed; mathematical depth is kept minimal.
What are the key topics?
Inference optimization, KV cache, quantization, serving frameworks, RAG, agents, evaluation, and production operations.
Cretisoft Direct
Hỗ trợ sách số
Tải Partner
Gửi sách sau thanh toán
