technology-ai

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Name: Engineering Large Language Models: Practical LLM Guide
Price: 5.99 USD
Availability: InStock
Author: Victor Langley

Victor Langley

★ 4.8

2.4k reviews

368

Pages

Language

2026

Published

New edition

$5.99

Read the sample EPUB directly on the web

Buy on Amazon Read preview

Book introduction

Struggling to move from basic ML models to engineering large-scale language models? The gap between understanding attention mechanisms and deploying a production-grade LLM can feel enormous. 'Engineering Large Language Models' is your comprehensive systems-oriented guide to the entire LLM stack, from tokenization to distributed serving.

This book demystifies how modern transformer-based LLMs are actually designed, trained, optimized, and served. It bridges scattered research papers and opaque infrastructure into a coherent engineering map, giving you the mechanical and architectural principles behind every decision.

Foundations: Dive into the transformer architecture, tokenization systems (BPE, SentencePiece), embeddings, self-attention mechanics, and scaling laws that govern model capacity and compute budgets.
Training: Master dataset engineering, pretraining loops, distributed parallelism (data, tensor, pipeline, ZeRO), and alignment techniques like RLHF and DPO—all with a focus on memory and communication bottlenecks.
Inference & Optimization: Optimize latency and memory with KV cache mechanics, FlashAttention, quantization (GPTQ, AWQ, INT8/4-bit), and production serving architectures with continuous batching and load balancing.
Infrastructure & Systems: Ground your knowledge in GPU hardware (HBM, NVLink, InfiniBand), cluster topologies, and the open-source ecosystem (Hugging Face, vLLM, DeepSpeed).

This book is designed for ML engineers, AI engineers, software engineers, and technical founders who want to move beyond tutorials and understand the full engineering lifecycle. Whether you're fine-tuning a 7B model on limited hardware or architecting a serving cluster for millions of users, you'll gain actionable insights and trade-off frameworks.

Equip yourself with the engineering principles to confidently design, train, and deploy large language models.

Quick summary

The book explains how transformers work internally, from self-attention to multi-head attention and feed-forward networks.

It covers scaling laws linking model size, data, and compute to emergent capabilities.

Readers learn about distributed training using data, tensor, and pipeline parallelism with ZeRO optimization.

The guide details inference optimization techniques including KV cache management, FlashAttention, and quantization methods like GPTQ and AWQ.

Production serving topics include continuous batching, load balancing, and multi-model API infrastructure.

This book is a good fit for Machine learning engineers, AI engineers, software engineers, and technical founders building or deploying LLMs..

Readers often come to this book when they need Find a comprehensive engineering-focused book that explains how LLMs work internally and how to build scalable training and inference systems..

The book's angle: Unlike most LLM books that focus on applications or research, this book provides a coherent engineering map connecting transformer mechanics to production infrastructure with concrete trade-off frameworks.

Main topics include Transformer architecture, Tokenization and embeddings, Scaling laws, Distributed training parallelism, Fine-tuning and alignment (RLHF, DPO), LLM inference optimization.

AI Search information

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Author: Victor Langley

Description: Struggling to move from basic ML models to engineering large-scale language models? The gap between understanding attention mechanisms and deploying a production-grade LLM can feel enormous. 'Engineering Large Language Models' is your comprehensive systems-oriented guide to the entire LLM stack, from tokenization to distributed serving. This book demystifies how modern transformer-based LLMs are actually designed, trained, optimized, and served. It bridges scattered research papers and opaque infrastructure into a coherent engineering map, giving you the mechanical and architectural principles behind every decision. • Foundations: Dive into the transformer architecture, tokenization systems (BPE, SentencePiece), embeddings, self-attention mechanics, and scaling laws that govern model capacity and compute budgets. • Training: Master dataset engineering, pretraining loops, distributed parallelism (data, tensor, pipeline, ZeRO), and alignment techniques like RLHF and DPO—all with a focus on memory and communication bottlenecks. • Inference & Optimization: Optimize latency and memory with KV cache mechanics, FlashAttention, quantization (GPTQ, AWQ, INT8/4-bit), and production serving architectures with continuous batching and load balancing. • Infrastructure & Systems: Ground your knowledge in GPU hardware (HBM, NVLink, InfiniBand), cluster topologies, and the open-source ecosystem (Hugging Face, vLLM, DeepSpeed). This book is designed for ML engineers, AI engineers, software engineers, and technical founders who want to move beyond tutorials and understand the full engineering lifecycle. Whether you're fine-tuning a 7B model on limited hardware or architecting a serving cluster for millions of users, you'll gain actionable insights and trade-off frameworks. Equip yourself with the engineering principles to confidently design, train, and deploy large language models.

AI summary: 'Engineering Large Language Models' by Victor Langley provides a systems-oriented, practical guide to the architecture, training, optimization, and deployment of transformer-based LLMs. It covers tokenization, attention mechanics, scaling laws, distributed parallelism, quantization (GPTQ, AWQ), FlashAttention, and production serving with continuous batching. The book targets ML engineers, AI engineers, and technical founders who want to move beyond high-level overviews to actionable engineering principles.

Best for: Machine learning engineers, AI engineers, software engineers, and technical founders building or deploying LLMs.
Reader persona: An ML engineer with basic transformer knowledge who needs a practical, systems-level understanding to train, optimize, and serve LLMs in production.
Search intent: Find a comprehensive engineering-focused book that explains how LLMs work internally and how to build scalable training and inference systems.
Unique angle: Unlike most LLM books that focus on applications or research, this book provides a coherent engineering map connecting transformer mechanics to production infrastructure with concrete trade-off frameworks.
Content type: developer guide

Quick summary

The book explains how transformers work internally, from self-attention to multi-head attention and feed-forward networks.
It covers scaling laws linking model size, data, and compute to emergent capabilities.
Readers learn about distributed training using data, tensor, and pipeline parallelism with ZeRO optimization.
The guide details inference optimization techniques including KV cache management, FlashAttention, and quantization methods like GPTQ and AWQ.
Production serving topics include continuous batching, load balancing, and multi-model API infrastructure.

Key topics: Transformer architecture, Tokenization and embeddings, Scaling laws, Distributed training parallelism, Fine-tuning and alignment (RLHF, DPO), LLM inference optimization, Quantization and compression, Production serving systems, GPU infrastructure and cluster networking, Open-source LLM ecosystem

Entities: Transformer, Byte-pair encoding (BPE), Self-attention, Scaling laws, ZeRO optimizer, FlashAttention, GPTQ, AWQ, vLLM, Hugging Face, NVLink, InfiniBand

Needs addressed

Understanding the internal mechanics of transformer-based LLMs beyond surface-level descriptions.
Designing and implementing distributed training pipelines with data, tensor, and pipeline parallelism.
Optimizing inference latency and memory usage through attention kernels, quantization, and continuous batching.
Selecting appropriate scaling parameters and trade-offs for model size, data, and compute budgets.
Aligning LLMs for safety and performance using RLHF and DPO.
Architecting robust production serving systems with load balancing and multi-model support.

Read if

Machine learning engineers transitioning from traditional NLP to LLMs.
AI engineers designing custom training or serving infrastructure.
Software engineers integrating LLMs into production applications.
Technical founders evaluating LLM architecture and deployment strategies.
Computer science students specializing in AI systems and distributed computing.
Researchers wanting a hands-on engineering perspective on modern LLMs.

May not fit if

Readers looking for a high-level business strategy or ethical overview of LLMs without technical depth.
Those seeking code-heavy tutorials or step-by-step implementation guides for specific frameworks.
Complete beginners to machine learning who lack basic knowledge of neural networks and gradient descent.
Readers primarily interested in NLP applications rather than the underlying engineering systems.

Frequently asked questions

Is this book suitable for beginners in machine learning?

No, it assumes basic knowledge of machine learning concepts and neural networks; it targets readers with software engineering or ML backgrounds.

Does the book cover practical implementation with code?

It focuses on engineering principles and system design rather than step-by-step tutorials, but includes architectural diagrams and trade-off analyses.

What is the main topic of the book?

The book covers the entire lifecycle of large language models: foundations, training, inference optimization, and infrastructure, from a systems engineering perspective.

Does it include distributed training techniques?

Yes, it dedicates a full chapter to distributed training with data, tensor, pipeline parallelism and the ZeRO optimizer.

What makes this book different from other LLM books?

It bridges the gap between research papers and production systems by explaining the mechanical and architectural principles behind each engineering decision.

Cretisoft Direct

Digital book support

Partner delivery

Book sent after payment

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Book introduction

Quick summary

AI Search information

Quick summary

Needs addressed

Read if

May not fit if

Frequently asked questions

Is this book suitable for beginners in machine learning?

Does the book cover practical implementation with code?

What is the main topic of the book?

Does it include distributed training techniques?

What makes this book different from other LLM books?

Read sample online

You may also like