ML Systems
Building and deploying ML models at scale is a constant tension between compute cost, latency, and memory. This chapter covers the systems techniques that manage that tension: distributed training, mixed precision, inference optimization, and profiling.
- Distributed Training: DDP, model parallelism, FSDP, DeepSpeed
- Mixed Precision: FP32/FP16/BF16/FP8, loss scaling, torch.amp
- Inference Optimization: KV-cache, quantization, speculative decoding, vLLM
- Profiling: torch.profiler, nsys, ncu, roofline model