Training modern ML models requires distributing computation across multiple devices. This chapter develops the mathematical framework for understanding parallelism, its fundamental limits, and the communication patterns used in distributed training.
Parallel architectures are classified by how instructions and data are organized:
Model
Full Name
Description
ML Example
SIMD
Single Instruction, Multiple Data
One instruction operates on multiple data elements
AVX-512 vector instructions on CPU
SIMT
Single Instruction, Multiple Threads
Threads in a warp execute the same instruction (with masking for divergence)
NVIDIA GPU warps (32 threads)
MIMD
Multiple Instructions, Multiple Data
Independent processors execute different instructions
Multi-core CPU, multi-node cluster
SPMD
Single Program, Multiple Data
Same program runs on all processors, each on different data
torchrun distributed training
In SIMT execution, when threads in a warp take different branches (divergence), both branches execute serially with the inactive threads masked. This means `if-else` in GPU kernels can halve throughput. Minimizing warp divergence is a key GPU optimization principle.
If a fraction $p$ of a computation is perfectly parallelizable and the remaining fraction $(1-p)$ is inherently serial, the maximum speedup with $n$ processors is:
S(n)=(1−p)+np1n→∞1−p1
Proof. Total time with one processor is T1=Tserial+Tparallel. With n processors, the parallel part speeds up by n: Tn=Tserial+Tparallel/n=(1−p)T1+pT1/n. The speedup is S(n)=T1/Tn=1/((1−p)+p/n). □
**Distributed training overhead.** Suppose 5% of your training pipeline is serial (data loading on rank 0, gradient synchronization latency, logging, checkpointing). Then:
- With 8 GPUs: $S(8) = 1/(0.05 + 0.95/8) = 1/0.169 \approx 5.9\times$ (74% efficiency)
- With 64 GPUs: $S(64) = 1/(0.05 + 0.95/64) = 1/0.065 \approx 15.4\times$ (24% efficiency)
- Theoretical maximum: $S(\infty) = 1/0.05 = 20\times$
This shows why scaling beyond a certain point yields diminishing returns. The 5% serial fraction caps the speedup at 20× no matter how many GPUs are added.
**Pipeline parallelism bubble.** In pipeline parallelism with $p$ stages and $m$ micro-batches, the pipeline "bubble" (time where some stages are idle) takes a fraction $\frac{p-1}{m + p - 1}$ of total time. For $p = 8$ stages and $m = 32$ micro-batches, the bubble is $7/39 \approx 18\%$, limiting pipeline speedup to roughly $8 \times 0.82 = 6.6\times$.
Amdahl's law assumes a fixed problem size. **Gustafson's law** assumes fixed execution time and asks: how much more work can $n$ processors accomplish?
S(n)=n−(1−p)(n−1)=1+p(n−1)
This gives a scaled speedup that grows linearly with n when p is close to 1.
Gustafson's law is the more relevant model for ML. With more GPUs, we do not just train the same model faster -- we train *larger models* or on *more data*. A model trained on 1024 GPUs is not the same workload as on 1 GPU scaled up; it is a fundamentally larger problem. This is why large-scale training achieves near-linear scaling: the serial fraction shrinks as the problem grows.
Large matrix operations are distributed across devices by partitioning the matrices. Three fundamental parallelism strategies exist:
Each of $N$ devices holds a complete copy of the model parameters $W$. The dataset is sharded so device $i$ processes mini-batch $X_i$. Forward and backward passes run independently. Gradients are synchronized via **allreduce** before the optimizer step:
∇W=N1∑i=1N∇WL(f(Xi;W),Yi)
The effective batch size scales as N×Blocal, where Blocal is the per-device batch size.
Data parallelism does not reduce per-device memory for model parameters and optimizer states -- each device holds a full copy. For a model with $P$ parameters in FP32 with Adam optimizer, per-device memory is $\approx 16P$ bytes ($4P$ parameters + $4P$ gradients + $4P$ first moments + $4P$ second moments). **ZeRO** [@rajbhandari2020zero] shards these across devices, reducing per-device memory to $\approx 16P/N$ bytes.
A single weight matrix $W \in \mathbb{R}^{m \times n}$ is partitioned across $k$ devices. The two most common partitioning schemes are:
Column-parallel:W=[W1∣W2∣⋯∣Wk] where Wi∈Rm×n/k. Each device computes Yi=XWi, producing a shard of the output. An allgather reconstructs the full output.
Row-parallel:W=W1W2⋮Wk where Wi∈Rm/k×n. Input X is sharded accordingly, and each device computes Yi=XiWi. A reduce-scatter (or allreduce) combines partial sums.
Megatron-LM [@shoeybi2019megatron] pairs column-parallel in the first linear layer of a transformer FFN with row-parallel in the second, so only one allreduce per FFN block is needed (instead of two allgather + reduce-scatter pairs). This is a key insight that reduces communication by 50%.
The model's $L$ layers are partitioned into $p$ contiguous stages across $p$ devices. Device $i$ owns layers $\ell_{i-1}+1$ through $\ell_i$ and processes activations sequentially:
h(ℓ)=fℓ(h(ℓ−1)),ℓ=1,…,L
To minimize the pipeline bubble, the mini-batch is split into mmicro-batches that are pipelined through the stages. GPipe (Huang et al., 2019) accumulates gradients across micro-batches; PipeDream (Narayanan et al., 2019) uses asynchronous weight updates.
Note: bandwidth costs are per device. m is the total message size in bytes, N is the number of devices.
The **ring allreduce** algorithm arranges $N$ devices in a logical ring. It proceeds in two phases:
Reduce-scatter phase: In N−1 steps, each device sends one chunk (m/N bytes) to its neighbor and accumulates the received chunk. After this phase, each device holds the fully reduced version of one chunk.
Allgather phase: In N−1 steps, each device propagates its fully reduced chunk around the ring.
Total bandwidth per device: 2⋅NN−1⋅m bytes. This is bandwidth-optimal -- it matches the information-theoretic lower bound.
**Allreduce cost for gradient sync.** A 70B parameter model with FP16 gradients has $m = 70 \times 10^9 \times 2 = 140$ GB of gradient data. With 8 GPUs on NVLink (900 GB/s bidirectional per link):
- Ring allreduce bandwidth per device: $2 \times \frac{7}{8} \times 140 = 245$ GB
- Time: $245 / 450 \approx 0.54$ seconds (using half-duplex bandwidth per direction)
- This is why gradient accumulation (computing multiple forward/backward passes before synchronizing) is used to amortize communication cost.
Modern distributed training frameworks overlap communication with computation by splitting the backward pass into chunks. As gradients for layer $\ell$ are computed, allreduce for layer $\ell+1$ (already completed) proceeds concurrently:
Perfect overlap is achieved when Tcompute≥Tcomm, i.e., when the computation for each chunk takes longer than the communication for the previous chunk.
**Scaling efficiency** is defined as $\eta = S(N) / N$ where $S(N)$ is the actual speedup with $N$ devices. State-of-the-art distributed training achieves $\eta > 0.9$ for hundreds of GPUs by:
Using fast interconnects (NVLink within a node, InfiniBand across nodes)
Overlapping allreduce with backward computation
Using large batch sizes to increase the compute-to-communication ratio
Combining data, tensor, and pipeline parallelism (3D parallelism) to minimize total communication
For example, training LLaMA-65B on 2048 A100 GPUs achieved ~50% MFU (Model FLOP Utilization), meaning half the theoretical peak compute was utilized -- the rest was lost to communication, pipeline bubbles, and memory-bound operations (Touvron et al., 2023).