Skip to main content

Parallel and Distributed Computing

Training modern ML models requires distributing computation across multiple devices. This chapter develops the mathematical framework for understanding parallelism, its fundamental limits, and the communication patterns used in distributed training.

Parallel Computing Models

Parallel architectures are classified by how instructions and data are organized:
ModelFull NameDescriptionML Example
SIMDSingle Instruction, Multiple DataOne instruction operates on multiple data elementsAVX-512 vector instructions on CPU
SIMTSingle Instruction, Multiple ThreadsThreads in a warp execute the same instruction (with masking for divergence)NVIDIA GPU warps (32 threads)
MIMDMultiple Instructions, Multiple DataIndependent processors execute different instructionsMulti-core CPU, multi-node cluster
SPMDSingle Program, Multiple DataSame program runs on all processors, each on different datatorchrun distributed training
In SIMT execution, when threads in a warp take different branches (divergence), both branches execute serially with the inactive threads masked. This means `if-else` in GPU kernels can halve throughput. Minimizing warp divergence is a key GPU optimization principle.

Amdahl's Law: The Limits of Parallelism

If a fraction $p$ of a computation is perfectly parallelizable and the remaining fraction $(1-p)$ is inherently serial, the maximum speedup with $n$ processors is:

S(n)=1(1p)+pnn11pS(n) = \frac{1}{(1 - p) + \frac{p}{n}} \xrightarrow{n \to \infty} \frac{1}{1 - p}

Proof. Total time with one processor is T1=Tserial+TparallelT_1 = T_{\text{serial}} + T_{\text{parallel}}. With nn processors, the parallel part speeds up by nn: Tn=Tserial+Tparallel/n=(1p)T1+pT1/nT_n = T_{\text{serial}} + T_{\text{parallel}}/n = (1-p)T_1 + pT_1/n. The speedup is S(n)=T1/Tn=1/((1p)+p/n)S(n) = T_1/T_n = 1/((1-p) + p/n). \square

**Distributed training overhead.** Suppose 5% of your training pipeline is serial (data loading on rank 0, gradient synchronization latency, logging, checkpointing). Then: - With 8 GPUs: $S(8) = 1/(0.05 + 0.95/8) = 1/0.169 \approx 5.9\times$ (74% efficiency) - With 64 GPUs: $S(64) = 1/(0.05 + 0.95/64) = 1/0.065 \approx 15.4\times$ (24% efficiency) - Theoretical maximum: $S(\infty) = 1/0.05 = 20\times$

This shows why scaling beyond a certain point yields diminishing returns. The 5% serial fraction caps the speedup at 20×20\times no matter how many GPUs are added.

**Pipeline parallelism bubble.** In pipeline parallelism with $p$ stages and $m$ micro-batches, the pipeline "bubble" (time where some stages are idle) takes a fraction $\frac{p-1}{m + p - 1}$ of total time. For $p = 8$ stages and $m = 32$ micro-batches, the bubble is $7/39 \approx 18\%$, limiting pipeline speedup to roughly $8 \times 0.82 = 6.6\times$.

Gustafson's Law: Scaling the Problem

Amdahl's law assumes a fixed problem size. **Gustafson's law** assumes fixed execution time and asks: how much more work can $n$ processors accomplish?

S(n)=n(1p)(n1)=1+p(n1)S(n) = n - (1 - p)(n - 1) = 1 + p(n - 1)

This gives a scaled speedup that grows linearly with nn when pp is close to 1.

Gustafson's law is the more relevant model for ML. With more GPUs, we do not just train the same model faster -- we train *larger models* or on *more data*. A model trained on 1024 GPUs is not the same workload as on 1 GPU scaled up; it is a fundamentally larger problem. This is why large-scale training achieves near-linear scaling: the serial fraction shrinks as the problem grows.

Distributed Linear Algebra

Large matrix operations are distributed across devices by partitioning the matrices. Three fundamental parallelism strategies exist:

Each of $N$ devices holds a complete copy of the model parameters $W$. The dataset is sharded so device $i$ processes mini-batch $X_i$. Forward and backward passes run independently. Gradients are synchronized via **allreduce** before the optimizer step:

W=1Ni=1NWL(f(Xi;W),Yi)\nabla W = \frac{1}{N} \sum_{i=1}^{N} \nabla_W \mathcal{L}(f(X_i; W), Y_i)

The effective batch size scales as N×BlocalN \times B_{\text{local}}, where BlocalB_{\text{local}} is the per-device batch size.

Data parallelism does not reduce per-device memory for model parameters and optimizer states -- each device holds a full copy. For a model with $P$ parameters in FP32 with Adam optimizer, per-device memory is $\approx 16P$ bytes ($4P$ parameters + $4P$ gradients + $4P$ first moments + $4P$ second moments). **ZeRO** [@rajbhandari2020zero] shards these across devices, reducing per-device memory to $\approx 16P/N$ bytes. A single weight matrix $W \in \mathbb{R}^{m \times n}$ is partitioned across $k$ devices. The two most common partitioning schemes are:

Column-parallel: W=[W1W2Wk]W = [W_1 | W_2 | \cdots | W_k] where WiRm×n/kW_i \in \mathbb{R}^{m \times n/k}. Each device computes Yi=XWiY_i = XW_i, producing a shard of the output. An allgather reconstructs the full output.

Row-parallel: W=[W1W2Wk]W = \begin{bmatrix} W_1 \\ W_2 \\ \vdots \\ W_k \end{bmatrix} where WiRm/k×nW_i \in \mathbb{R}^{m/k \times n}. Input XX is sharded accordingly, and each device computes Yi=XiWiY_i = X_i W_i. A reduce-scatter (or allreduce) combines partial sums.

Megatron-LM [@shoeybi2019megatron] pairs column-parallel in the first linear layer of a transformer FFN with row-parallel in the second, so only one allreduce per FFN block is needed (instead of two allgather + reduce-scatter pairs). This is a key insight that reduces communication by 50%. The model's $L$ layers are partitioned into $p$ contiguous stages across $p$ devices. Device $i$ owns layers $\ell_{i-1}+1$ through $\ell_i$ and processes activations sequentially:

h()=f(h(1)),=1,,Lh^{(\ell)} = f_\ell(h^{(\ell-1)}), \quad \ell = 1, \ldots, L

To minimize the pipeline bubble, the mini-batch is split into mm micro-batches that are pipelined through the stages. GPipe (Huang et al., 2019) accumulates gradients across micro-batches; PipeDream (Narayanan et al., 2019) uses asynchronous weight updates.

Communication Primitives

PrimitiveOperationBandwidth CostLatency CostUsed In
BroadcastCopy from one device to allmmO(logN)O(\log N)Distributing initial weights
ReduceAggregate all to onemmO(logN)O(\log N)Collecting global loss
AllreduceReduce + broadcast2(N1)Nm\frac{2(N-1)}{N} mO(logN)O(\log N)Gradient sync (DDP)
AllgatherGather shards to allN1Nm\frac{N-1}{N} mO(logN)O(\log N)Tensor parallelism output
Reduce-scatterReduce + scatter resultN1Nm\frac{N-1}{N} mO(logN)O(\log N)ZeRO, tensor parallelism
All-to-allEach device sends shard to eachN1Nm\frac{N-1}{N} mO(N)O(N)Expert parallelism (MoE)

Note: bandwidth costs are per device. mm is the total message size in bytes, NN is the number of devices.

The **ring allreduce** algorithm arranges $N$ devices in a logical ring. It proceeds in two phases:
  1. Reduce-scatter phase: In N1N-1 steps, each device sends one chunk (m/Nm/N bytes) to its neighbor and accumulates the received chunk. After this phase, each device holds the fully reduced version of one chunk.
  2. Allgather phase: In N1N-1 steps, each device propagates its fully reduced chunk around the ring.

Total bandwidth per device: 2N1Nm2 \cdot \frac{N-1}{N} \cdot m bytes. This is bandwidth-optimal -- it matches the information-theoretic lower bound.

**Allreduce cost for gradient sync.** A 70B parameter model with FP16 gradients has $m = 70 \times 10^9 \times 2 = 140$ GB of gradient data. With 8 GPUs on NVLink (900 GB/s bidirectional per link): - Ring allreduce bandwidth per device: $2 \times \frac{7}{8} \times 140 = 245$ GB - Time: $245 / 450 \approx 0.54$ seconds (using half-duplex bandwidth per direction) - This is why gradient accumulation (computing multiple forward/backward passes before synchronizing) is used to amortize communication cost.

Communication--Computation Overlap

Modern distributed training frameworks overlap communication with computation by splitting the backward pass into chunks. As gradients for layer $\ell$ are computed, allreduce for layer $\ell+1$ (already completed) proceeds concurrently:

Ttotal=Tcompute+TcommToverlapmax(Tcompute,Tcomm)T_{\text{total}} = T_{\text{compute}} + T_{\text{comm}} - T_{\text{overlap}} \leq \max(T_{\text{compute}}, T_{\text{comm}})

Perfect overlap is achieved when TcomputeTcommT_{\text{compute}} \geq T_{\text{comm}}, i.e., when the computation for each chunk takes longer than the communication for the previous chunk.

**Scaling efficiency** is defined as $\eta = S(N) / N$ where $S(N)$ is the actual speedup with $N$ devices. State-of-the-art distributed training achieves $\eta > 0.9$ for hundreds of GPUs by:
  1. Using fast interconnects (NVLink within a node, InfiniBand across nodes)
  2. Overlapping allreduce with backward computation
  3. Using large batch sizes to increase the compute-to-communication ratio
  4. Combining data, tensor, and pipeline parallelism (3D parallelism) to minimize total communication

For example, training LLaMA-65B on 2048 A100 GPUs achieved ~50% MFU (Model FLOP Utilization), meaning half the theoretical peak compute was utilized -- the rest was lost to communication, pipeline bubbles, and memory-bound operations (Touvron et al., 2023).

Notation Summary

SymbolMeaning
ppParallel fraction (Amdahl) or number of pipeline stages
NNNumber of processors/devices
S(N)S(N)Speedup with NN processors
η\etaScaling efficiency: S(N)/NS(N)/N
mmMessage size (bytes)
BlocalB_{\text{local}}Per-device batch size
FFDevice compute throughput (FLOPS)
BBInterconnect bandwidth (bytes/s)
MFUModel FLOP Utilization
SIMD, SIMT, SPMDParallel execution models
DDPDistributed Data Parallel
FSDP / ZeROFully Sharded Data Parallel / Zero Redundancy Optimizer

References