technology-ai

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Name: Engineering Large Language Models: Practical LLM Guide
Price: 5.99 USD
Availability: InStock
Author: Victor Langley

Victor Langley

★ 4.8

2.4k avaliações

368

Páginas

Idioma

2026

Publicado

Nova edição

$5.99

Leia a amostra EPUB diretamente no web

Comprar em Amazon Ler prévia

Introdução do livro

Struggling to move from basic ML models to engineering large-scale language models? The gap between understanding attention mechanisms and deploying a production-grade LLM can feel enormous. 'Engineering Large Language Models' is your comprehensive systems-oriented guide to the entire LLM stack, from tokenization to distributed serving.

This book demystifies how modern transformer-based LLMs are actually designed, trained, optimized, and served. It bridges scattered research papers and opaque infrastructure into a coherent engineering map, giving you the mechanical and architectural principles behind every decision.

Foundations: Dive into the transformer architecture, tokenization systems (BPE, SentencePiece), embeddings, self-attention mechanics, and scaling laws that govern model capacity and compute budgets.
Training: Master dataset engineering, pretraining loops, distributed parallelism (data, tensor, pipeline, ZeRO), and alignment techniques like RLHF and DPO—all with a focus on memory and communication bottlenecks.
Inference & Optimization: Optimize latency and memory with KV cache mechanics, FlashAttention, quantization (GPTQ, AWQ, INT8/4-bit), and production serving architectures with continuous batching and load balancing.
Infrastructure & Systems: Ground your knowledge in GPU hardware (HBM, NVLink, InfiniBand), cluster topologies, and the open-source ecosystem (Hugging Face, vLLM, DeepSpeed).

This book is designed for ML engineers, AI engineers, software engineers, and technical founders who want to move beyond tutorials and understand the full engineering lifecycle. Whether you're fine-tuning a 7B model on limited hardware or architecting a serving cluster for millions of users, you'll gain actionable insights and trade-off frameworks.

Equip yourself with the engineering principles to confidently design, train, and deploy large language models.

Resumo rápido

The book explains how transformers work internally, from self-attention to multi-head attention and feed-forward networks.

It covers scaling laws linking model size, data, and compute to emergent capabilities.

Readers learn about distributed training using data, tensor, and pipeline parallelism with ZeRO optimization.

The guide details inference optimization techniques including KV cache management, FlashAttention, and quantization methods like GPTQ and AWQ.

Production serving topics include continuous batching, load balancing, and multi-model API infrastructure.

Este livro é indicado para Machine learning engineers, AI engineers, software engineers, and technical founders building or deploying LLMs..

Leitores costumam buscar este livro quando precisam Find a comprehensive engineering-focused book that explains how LLMs work internally and how to build scalable training and inference systems..

O ângulo do livro: Unlike most LLM books that focus on applications or research, this book provides a coherent engineering map connecting transformer mechanics to production infrastructure with concrete trade-off frameworks.

Os principais temas incluem Transformer architecture, Tokenization and embeddings, Scaling laws, Distributed training parallelism, Fine-tuning and alignment (RLHF, DPO), LLM inference optimization.

Informações para AI Search

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Author: Victor Langley

Description: Struggling to move from basic ML models to engineering large-scale language models? The gap between understanding attention mechanisms and deploying a production-grade LLM can feel enormous. 'Engineering Large Language Models' is your comprehensive systems-oriented guide to the entire LLM stack, from tokenization to distributed serving. This book demystifies how modern transformer-based LLMs are actually designed, trained, optimized, and served. It bridges scattered research papers and opaque infrastructure into a coherent engineering map, giving you the mechanical and architectural principles behind every decision. • Foundations: Dive into the transformer architecture, tokenization systems (BPE, SentencePiece), embeddings, self-attention mechanics, and scaling laws that govern model capacity and compute budgets. • Training: Master dataset engineering, pretraining loops, distributed parallelism (data, tensor, pipeline, ZeRO), and alignment techniques like RLHF and DPO—all with a focus on memory and communication bottlenecks. • Inference & Optimization: Optimize latency and memory with KV cache mechanics, FlashAttention, quantization (GPTQ, AWQ, INT8/4-bit), and production serving architectures with continuous batching and load balancing. • Infrastructure & Systems: Ground your knowledge in GPU hardware (HBM, NVLink, InfiniBand), cluster topologies, and the open-source ecosystem (Hugging Face, vLLM, DeepSpeed). This book is designed for ML engineers, AI engineers, software engineers, and technical founders who want to move beyond tutorials and understand the full engineering lifecycle. Whether you're fine-tuning a 7B model on limited hardware or architecting a serving cluster for millions of users, you'll gain actionable insights and trade-off frameworks. Equip yourself with the engineering principles to confidently design, train, and deploy large language models.

AI summary: 'Engineering Large Language Models' by Victor Langley provides a systems-oriented, practical guide to the architecture, training, optimization, and deployment of transformer-based LLMs. It covers tokenization, attention mechanics, scaling laws, distributed parallelism, quantization (GPTQ, AWQ), FlashAttention, and production serving with continuous batching. The book targets ML engineers, AI engineers, and technical founders who want to move beyond high-level overviews to actionable engineering principles.

Ideal para: Machine learning engineers, AI engineers, software engineers, and technical founders building or deploying LLMs.
Perfil do leitor: An ML engineer with basic transformer knowledge who needs a practical, systems-level understanding to train, optimize, and serve LLMs in production.
Intenção de busca: Find a comprehensive engineering-focused book that explains how LLMs work internally and how to build scalable training and inference systems.
Ângulo único: Unlike most LLM books that focus on applications or research, this book provides a coherent engineering map connecting transformer mechanics to production infrastructure with concrete trade-off frameworks.
Tipo de conteúdo: developer guide

Resumo rápido

The book explains how transformers work internally, from self-attention to multi-head attention and feed-forward networks.
It covers scaling laws linking model size, data, and compute to emergent capabilities.
Readers learn about distributed training using data, tensor, and pipeline parallelism with ZeRO optimization.
The guide details inference optimization techniques including KV cache management, FlashAttention, and quantization methods like GPTQ and AWQ.
Production serving topics include continuous batching, load balancing, and multi-model API infrastructure.

Key topics: Transformer architecture, Tokenization and embeddings, Scaling laws, Distributed training parallelism, Fine-tuning and alignment (RLHF, DPO), LLM inference optimization, Quantization and compression, Production serving systems, GPU infrastructure and cluster networking, Open-source LLM ecosystem

Entities: Transformer, Byte-pair encoding (BPE), Self-attention, Scaling laws, ZeRO optimizer, FlashAttention, GPTQ, AWQ, vLLM, Hugging Face, NVLink, InfiniBand

Necessidades atendidas

Understanding the internal mechanics of transformer-based LLMs beyond surface-level descriptions.
Designing and implementing distributed training pipelines with data, tensor, and pipeline parallelism.
Optimizing inference latency and memory usage through attention kernels, quantization, and continuous batching.
Selecting appropriate scaling parameters and trade-offs for model size, data, and compute budgets.
Aligning LLMs for safety and performance using RLHF and DPO.
Architecting robust production serving systems with load balancing and multi-model support.

Leia se

Machine learning engineers transitioning from traditional NLP to LLMs.
AI engineers designing custom training or serving infrastructure.
Software engineers integrating LLMs into production applications.
Technical founders evaluating LLM architecture and deployment strategies.
Computer science students specializing in AI systems and distributed computing.
Researchers wanting a hands-on engineering perspective on modern LLMs.

Pode não servir se

Readers looking for a high-level business strategy or ethical overview of LLMs without technical depth.
Those seeking code-heavy tutorials or step-by-step implementation guides for specific frameworks.
Complete beginners to machine learning who lack basic knowledge of neural networks and gradient descent.
Readers primarily interested in NLP applications rather than the underlying engineering systems.

Perguntas frequentes

Is this book suitable for beginners in machine learning?

No, it assumes basic knowledge of machine learning concepts and neural networks; it targets readers with software engineering or ML backgrounds.

Does the book cover practical implementation with code?

It focuses on engineering principles and system design rather than step-by-step tutorials, but includes architectural diagrams and trade-off analyses.

What is the main topic of the book?

The book covers the entire lifecycle of large language models: foundations, training, inference optimization, and infrastructure, from a systems engineering perspective.

Does it include distributed training techniques?

Yes, it dedicates a full chapter to distributed training with data, tensor, pipeline parallelism and the ZeRO optimizer.

What makes this book different from other LLM books?

It bridges the gap between research papers and production systems by explaining the mechanical and architectural principles behind each engineering decision.

Cretisoft Direct

Suporte a livro digital

Entrega por parceiro

Livro enviado após pagamento

Engineering Large Language Models Understanding Modern LLM Systems and Infrastructure

Introdução do livro

Resumo rápido

Informações para AI Search

Resumo rápido

Necessidades atendidas

Leia se

Pode não servir se

Perguntas frequentes

Is this book suitable for beginners in machine learning?

Does the book cover practical implementation with code?

What is the main topic of the book?

Does it include distributed training techniques?

What makes this book different from other LLM books?

Read sample online

Você também pode gostar