technology-ai

Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure

Name: Systems for Large Language Models: Inference, Agents, and AI Infrastr...
Price: 4.99 USD
Availability: InStock
Author: Miles Thornton

Miles Thornton

Book 4#4

★ 4.8

2.4k đánh giá

581

Trang

Ngôn ngữ

2026

Tái bản

Bản mới

4,99 US$

Đọc EPUB mẫu trực tiếp trên web

Mua trên Google Books Đọc mẫu

Giới thiệu sách

Your GPT-4 deployment is using less than 30% of your GPU capacity. Your latency spikes are unpredictable, and your team spends more time debugging infrastructure than shipping features. This is the reality of production AI that no one talks about until after the model is trained.

Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure is the first engineering-focused guide that bridges the gap between model development and production operations. Written for software and AI engineers, this book systematically covers the entire lifecycle of LLM deployment—from the raw token generation loop to enterprise-scale multi-agent orchestration. It eschews mathematical theory in favor of architectural patterns, framework comparisons, and operational tradeoffs.

The book is organized into 32 chapters across 8 parts, each addressing a critical layer of the AI infrastructure stack:

Understand why the KV cache dominates memory costs and how PagedAttention solves it.
Compare quantization methods (GPTQ, AWQ, GGUF) and choose the right precision for your latency and budget.
Master modern serving frameworks like vLLM, SGLang, and TGI with deployment-ready configurations.

Beyond inference, the book dives into retrieval-augmented generation (RAG) with advanced techniques like hybrid search and agentic retrieval. You will learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns. The final parts cover evaluation, guardrails, observability, and cost optimization—the operational disciplines that separate hobby projects from reliable products.

This book is for backend engineers moving into AI, ML engineers who need to understand infrastructure, and technical architects responsible for platform decisions. If you design, build, or operate LLM-powered applications in production, this is your reference manual.

You will walk away with a systems-level mental model of AI infrastructure, the ability to evaluate frameworks with engineering rigor, and a clear path to building scalable, cost-effective AI products.

Tóm tắt nhanh

This book explains how the KV cache dominates GPU memory costs and how PagedAttention reduces memory fragmentation.

It compares quantization methods like GPTQ, AWQ, and GGUF to help readers choose the right precision for latency and budget.

Readers learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns.

The book covers evaluation, guardrails, and observability to ensure reliable and safe AI deployments.

It provides deployment-ready configurations for vLLM and other serving frameworks.

Cuốn sách này phù hợp với Software engineers, AI/ML engineers, and technical architects building LLM-powered applications..

Người đọc thường tìm đến sách khi cần Professionals searching for a comprehensive engineering guide to design, deploy, and scale LLM inference servers, agent systems, and production AI infrastructure..

Góc tiếp cận của sách: This book focuses on the systems and engineering decisions behind LLM deployment, bridging the gap between ML model development and production operations, with a practical, architecture-first approach.

Các chủ đề chính gồm Inference optimization, KV cache management, Quantization, Model serving (vLLM, TGI, SGLang), Retrieval-Augmented Generation, Embeddings and vector databases.

Thông tin cho AI Search

Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure

Author: Miles Thornton

Description: Your GPT-4 deployment is using less than 30% of your GPU capacity. Your latency spikes are unpredictable, and your team spends more time debugging infrastructure than shipping features. This is the reality of production AI that no one talks about until after the model is trained. Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure is the first engineering-focused guide that bridges the gap between model development and production operations. Written for software and AI engineers, this book systematically covers the entire lifecycle of LLM deployment—from the raw token generation loop to enterprise-scale multi-agent orchestration. It eschews mathematical theory in favor of architectural patterns, framework comparisons, and operational tradeoffs. The book is organized into 32 chapters across 8 parts, each addressing a critical layer of the AI infrastructure stack: • Understand why the KV cache dominates memory costs and how PagedAttention solves it. • Compare quantization methods (GPTQ, AWQ, GGUF) and choose the right precision for your latency and budget. • Master modern serving frameworks like vLLM, SGLang, and TGI with deployment-ready configurations. Beyond inference, the book dives into retrieval-augmented generation (RAG) with advanced techniques like hybrid search and agentic retrieval. You will learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns. The final parts cover evaluation, guardrails, observability, and cost optimization—the operational disciplines that separate hobby projects from reliable products. This book is for backend engineers moving into AI, ML engineers who need to understand infrastructure, and technical architects responsible for platform decisions. If you design, build, or operate LLM-powered applications in production, this is your reference manual. You will walk away with a systems-level mental model of AI infrastructure, the ability to evaluate frameworks with engineering rigor, and a clear path to building scalable, cost-effective AI products.

AI summary: This book provides a systems-level guide to deploying large language models in production. It covers inference optimization, serving frameworks (vLLM, TGI, SGLang), retrieval-augmented generation, agent architectures, evaluation, and cost management. Targeted at software engineers and AI engineers, it emphasizes architectural patterns and operational tradeoffs over mathematical theory.

Phù hợp với: Software engineers, AI/ML engineers, and technical architects building LLM-powered applications.
Chân dung độc giả: A backend or ML engineer responsible for deploying and operating LLM systems in production, seeking practical architectural patterns and framework comparisons.
Nhu cầu tìm kiếm: Professionals searching for a comprehensive engineering guide to design, deploy, and scale LLM inference servers, agent systems, and production AI infrastructure.
Góc tiếp cận: This book focuses on the systems and engineering decisions behind LLM deployment, bridging the gap between ML model development and production operations, with a practical, architecture-first approach.
Loại nội dung: developer guide

Tóm tắt nhanh

This book explains how the KV cache dominates GPU memory costs and how PagedAttention reduces memory fragmentation.
It compares quantization methods like GPTQ, AWQ, and GGUF to help readers choose the right precision for latency and budget.
Readers learn to design autonomous agents using ReAct, planner-executor, and multi-agent coordination patterns.
The book covers evaluation, guardrails, and observability to ensure reliable and safe AI deployments.
It provides deployment-ready configurations for vLLM and other serving frameworks.

Key topics: Inference optimization, KV cache management, Quantization, Model serving (vLLM, TGI, SGLang), Retrieval-Augmented Generation, Embeddings and vector databases, Agent architectures, Multi-agent systems, Evaluation and guardrails, Production AI operations

Entities: LLM inference, KV cache, PagedAttention, Quantization (GPTQ, AWQ, GGUF), vLLM, RAG, Embeddings, Agent ReAct, Observability, AI infrastructure

Nhu cầu được đáp ứng

Reducing GPU memory usage and latency during LLM inference.
Selecting the right inference server and deployment configuration.
Integrating external knowledge into LLM outputs via RAG.
Designing autonomous agents that reliably use tools and APIs.
Monitoring and evaluating LLM systems to detect hallucinations and drift.
Optimizing costs for large-scale AI infrastructure.

Nên đọc nếu

Backend engineers transitioning to AI
ML engineers needing infrastructure knowledge
Technical architects designing AI platforms
DevOps/SRE teams managing GPU clusters
AI product developers
Students in AI engineering programs

Có thể không phù hợp nếu

Data scientists focused only on model training without deployment interest
Researchers seeking math-heavy theoretical ML content
Beginners without basic programming or API knowledge

Mục lục

Introduction (introduction)
The Inference Problem (part)
From Training to Production (chapter)
Why Training Is Only the Beginning (section)
The Inference Challenge (section)
Latency vs Quality (section)
Cost vs Performance (section)
Production Requirements (section)
Anatomy of an Inference Engine (chapter)
Token Generation (section)
Decoding Loops (section)
Batching (section)
Scheduling (section)
Throughput Optimization (section)
KV Cache (chapter)
Why KV Cache Exists (section)
Memory Tradeoffs (section)
Context Reuse (section)
Long Conversations (section)
Modern Decoding Techniques (chapter)
Greedy Decoding (section)
Beam Search (section)
Top-k Sampling (section)
Top-p Sampling (section)
Temperature Control (section)
Optimizing Language Models (part)
Quantization Fundamentals (chapter)
Precision and Memory (section)
FP32 (section)
FP16 (section)
BF16 (section)
INT8 (section)
INT4 (section)
Modern Quantization Methods (chapter)
GPTQ (section)
AWQ (section)
GGUF (section)
Dynamic Quantization (section)
Tradeoffs (section)
Efficient Attention Systems (chapter)
FlashAttention (section)
Paged Attention (section)
Sliding Window Attention (section)
Long Context Optimization (section)
Speculative Decoding (chapter)
Draft Models (section)
Verification Models (section)
Speed Improvements (section)
Practical Usage (section)
Serving Language Models (part)
Model Serving Architectures (chapter)
API Services (section)
Inference Servers (section)
Multi-Tenant Systems (section)
Production Patterns (section)
vLLM (chapter)
Architecture (section)
Continuous Batching (section)
Memory Efficiency (section)
Real-World Deployment (section)
Alternative Serving Frameworks (chapter)
SGLang (section)
TGI (section)
Ollama (section)
Emerging Systems (section)
Scaling AI Services (chapter)
Load Balancing (section)
Autoscaling (section)
GPU Pools (section)
Capacity Planning (section)
Retrieval-Augmented Generation (part)
Why Models Need Retrieval (chapter)
Knowledge Limitations (section)
Hallucinations (section)
Fresh Information (section)
Enterprise Requirements (section)
Embeddings (chapter)
Representation Learning (section)
Similarity Search (section)
Embedding Models (section)

Câu hỏi thường gặp

What is the main focus of this book?

It covers the entire lifecycle of deploying LLMs in production, from inference optimization to agent systems and platform operations.

Who is this book for?

Software engineers, AI/ML engineers, and technical architects building or operating LLM-powered applications.

Does the book cover specific tools?

Yes, it compares vLLM, TGI, SGLang, and other serving frameworks, and discusses quantization methods like GPTQ, AWQ, and GGUF.

Is prior ML knowledge required?

Familiarity with basic transformer concepts and Python is assumed; mathematical depth is kept minimal.

What are the key topics?

Inference optimization, KV cache, quantization, serving frameworks, RAG, agents, evaluation, and production operations.

Cretisoft Direct

Hỗ trợ sách số

Tải Partner

Gửi sách sau thanh toán

Systems for Large Language Models: Inference, Agents, and Production AI Infrastructure

Giới thiệu sách

Tóm tắt nhanh

Thông tin cho AI Search

Tóm tắt nhanh

Nhu cầu được đáp ứng

Nên đọc nếu

Có thể không phù hợp nếu

Mục lục

Câu hỏi thường gặp

What is the main focus of this book?

Who is this book for?

Does the book cover specific tools?

Is prior ML knowledge required?

What are the key topics?

Đọc thử trên web

Có thể bạn sẽ thích