technology-ai

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Victor Langley

4.8

2.4k reviews

368

Pages

en

Language

2026

Published

New edition

$5.99

Read the sample EPUB directly on the web

Book introduction

Struggling to move from basic ML models to engineering large-scale language models? The gap between understanding attention mechanisms and deploying a production-grade LLM can feel enormous. 'Engineering Large Language Models' is your comprehensive systems-oriented guide to the entire LLM stack, from tokenization to distributed serving.

This book demystifies how modern transformer-based LLMs are actually designed, trained, optimized, and served. It bridges scattered research papers and opaque infrastructure into a coherent engineering map, giving you the mechanical and architectural principles behind every decision.

  • Foundations: Dive into the transformer architecture, tokenization systems (BPE, SentencePiece), embeddings, self-attention mechanics, and scaling laws that govern model capacity and compute budgets.
  • Training: Master dataset engineering, pretraining loops, distributed parallelism (data, tensor, pipeline, ZeRO), and alignment techniques like RLHF and DPO—all with a focus on memory and communication bottlenecks.
  • Inference & Optimization: Optimize latency and memory with KV cache mechanics, FlashAttention, quantization (GPTQ, AWQ, INT8/4-bit), and production serving architectures with continuous batching and load balancing.
  • Infrastructure & Systems: Ground your knowledge in GPU hardware (HBM, NVLink, InfiniBand), cluster topologies, and the open-source ecosystem (Hugging Face, vLLM, DeepSpeed).

This book is designed for ML engineers, AI engineers, software engineers, and technical founders who want to move beyond tutorials and understand the full engineering lifecycle. Whether you're fine-tuning a 7B model on limited hardware or architecting a serving cluster for millions of users, you'll gain actionable insights and trade-off frameworks.

Equip yourself with the engineering principles to confidently design, train, and deploy large language models.

Quick summary

The book explains how transformers work internally, from self-attention to multi-head attention and feed-forward networks.

It covers scaling laws linking model size, data, and compute to emergent capabilities.

Readers learn about distributed training using data, tensor, and pipeline parallelism with ZeRO optimization.

The guide details inference optimization techniques including KV cache management, FlashAttention, and quantization methods like GPTQ and AWQ.

Production serving topics include continuous batching, load balancing, and multi-model API infrastructure.

This book is a good fit for Machine learning engineers, AI engineers, software engineers, and technical founders building or deploying LLMs..

Readers often come to this book when they need Find a comprehensive engineering-focused book that explains how LLMs work internally and how to build scalable training and inference systems..

The book's angle: Unlike most LLM books that focus on applications or research, this book provides a coherent engineering map connecting transformer mechanics to production infrastructure with concrete trade-off frameworks.

Main topics include Transformer architecture, Tokenization and embeddings, Scaling laws, Distributed training parallelism, Fine-tuning and alignment (RLHF, DPO), LLM inference optimization.

AI Search information

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Author: Victor Langley

Description: Struggling to move from basic ML models to engineering large-scale language models? The gap between understanding attention mechanisms and deploying a production-grade LLM can feel enormous. 'Engineering Large Language Models' is your comprehensive systems-oriented guide to the entire LLM stack, from tokenization to distributed serving. This book demystifies how modern transformer-based LLMs are actually designed, trained, optimized, and served. It bridges scattered research papers and opaque infrastructure into a coherent engineering map, giving you the mechanical and architectural principles behind every decision. • Foundations: Dive into the transformer architecture, tokenization systems (BPE, SentencePiece), embeddings, self-attention mechanics, and scaling laws that govern model capacity and compute budgets. • Training: Master dataset engineering, pretraining loops, distributed parallelism (data, tensor, pipeline, ZeRO), and alignment techniques like RLHF and DPO—all with a focus on memory and communication bottlenecks. • Inference & Optimization: Optimize latency and memory with KV cache mechanics, FlashAttention, quantization (GPTQ, AWQ, INT8/4-bit), and production serving architectures with continuous batching and load balancing. • Infrastructure & Systems: Ground your knowledge in GPU hardware (HBM, NVLink, InfiniBand), cluster topologies, and the open-source ecosystem (Hugging Face, vLLM, DeepSpeed). This book is designed for ML engineers, AI engineers, software engineers, and technical founders who want to move beyond tutorials and understand the full engineering lifecycle. Whether you're fine-tuning a 7B model on limited hardware or architecting a serving cluster for millions of users, you'll gain actionable insights and trade-off frameworks. Equip yourself with the engineering principles to confidently design, train, and deploy large language models.

AI summary: 'Engineering Large Language Models' by Victor Langley provides a systems-oriented, practical guide to the architecture, training, optimization, and deployment of transformer-based LLMs. It covers tokenization, attention mechanics, scaling laws, distributed parallelism, quantization (GPTQ, AWQ), FlashAttention, and production serving with continuous batching. The book targets ML engineers, AI engineers, and technical founders who want to move beyond high-level overviews to actionable engineering principles.

Best for
Machine learning engineers, AI engineers, software engineers, and technical founders building or deploying LLMs.
Reader persona
An ML engineer with basic transformer knowledge who needs a practical, systems-level understanding to train, optimize, and serve LLMs in production.
Search intent
Find a comprehensive engineering-focused book that explains how LLMs work internally and how to build scalable training and inference systems.
Unique angle
Unlike most LLM books that focus on applications or research, this book provides a coherent engineering map connecting transformer mechanics to production infrastructure with concrete trade-off frameworks.
Content type
developer guide

Quick summary

  • The book explains how transformers work internally, from self-attention to multi-head attention and feed-forward networks.
  • It covers scaling laws linking model size, data, and compute to emergent capabilities.
  • Readers learn about distributed training using data, tensor, and pipeline parallelism with ZeRO optimization.
  • The guide details inference optimization techniques including KV cache management, FlashAttention, and quantization methods like GPTQ and AWQ.
  • Production serving topics include continuous batching, load balancing, and multi-model API infrastructure.

Key topics: Transformer architecture, Tokenization and embeddings, Scaling laws, Distributed training parallelism, Fine-tuning and alignment (RLHF, DPO), LLM inference optimization, Quantization and compression, Production serving systems, GPU infrastructure and cluster networking, Open-source LLM ecosystem

Entities: Transformer, Byte-pair encoding (BPE), Self-attention, Scaling laws, ZeRO optimizer, FlashAttention, GPTQ, AWQ, vLLM, Hugging Face, NVLink, InfiniBand

Needs addressed

  • Understanding the internal mechanics of transformer-based LLMs beyond surface-level descriptions.
  • Designing and implementing distributed training pipelines with data, tensor, and pipeline parallelism.
  • Optimizing inference latency and memory usage through attention kernels, quantization, and continuous batching.
  • Selecting appropriate scaling parameters and trade-offs for model size, data, and compute budgets.
  • Aligning LLMs for safety and performance using RLHF and DPO.
  • Architecting robust production serving systems with load balancing and multi-model support.

Read if

  • Machine learning engineers transitioning from traditional NLP to LLMs.
  • AI engineers designing custom training or serving infrastructure.
  • Software engineers integrating LLMs into production applications.
  • Technical founders evaluating LLM architecture and deployment strategies.
  • Computer science students specializing in AI systems and distributed computing.
  • Researchers wanting a hands-on engineering perspective on modern LLMs.

May not fit if

  • Readers looking for a high-level business strategy or ethical overview of LLMs without technical depth.
  • Those seeking code-heavy tutorials or step-by-step implementation guides for specific frameworks.
  • Complete beginners to machine learning who lack basic knowledge of neural networks and gradient descent.
  • Readers primarily interested in NLP applications rather than the underlying engineering systems.

Frequently asked questions

Is this book suitable for beginners in machine learning?

No, it assumes basic knowledge of machine learning concepts and neural networks; it targets readers with software engineering or ML backgrounds.

Does the book cover practical implementation with code?

It focuses on engineering principles and system design rather than step-by-step tutorials, but includes architectural diagrams and trade-off analyses.

What is the main topic of the book?

The book covers the entire lifecycle of large language models: foundations, training, inference optimization, and infrastructure, from a systems engineering perspective.

Does it include distributed training techniques?

Yes, it dedicates a full chapter to distributed training with data, tensor, pipeline parallelism and the ZeRO optimizer.

What makes this book different from other LLM books?

It bridges the gap between research papers and production systems by explaining the mechanical and architectural principles behind each engineering decision.

C

Cretisoft Direct

Digital book support

T

Partner delivery

Book sent after payment

Sample EPUB

Read sample online

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

You may also like

Based on your reading history

View all