ML Systems

Building and deploying ML models at scale is a constant tension between compute cost, latency, and memory. This chapter covers the systems techniques that manage that tension: distributed training, mixed precision, inference optimization, and profiling.

Distributed Training: DDP, model parallelism, FSDP, DeepSpeed
Mixed Precision: FP32/FP16/BF16/FP8, loss scaling, torch.amp
Inference Optimization: KV-cache, quantization, speculative decoding, vLLM
Profiling: torch.profiler, nsys, ncu, roofline model