Skip to main content

Numerical Computing Hardware

Every ML algorithm ultimately runs on physical hardware that represents real numbers with finite precision and moves data through a memory hierarchy. Understanding these constraints is not optional -- they determine which algorithms are fast, which are numerically stable, and why certain design choices (batch size, precision, operator fusion) matter.

Von Neumann Architecture

Modern CPUs follow the **Von Neumann model**: a single shared memory stores both instructions and data, connected to a processing unit (ALU), a control unit, and I/O devices via buses. The processor fetches instructions from memory, decodes them, and executes them sequentially (with pipelining and out-of-order execution to hide latency).

The key performance bottleneck is the memory wall -- processor speed has improved roughly 1000×1000\times faster than memory latency over the past four decades. As a result, modern computation is almost always limited by data movement, not arithmetic.

Memory Hierarchy

To mitigate the memory wall, hardware uses a hierarchy of progressively larger but slower memories:

LevelLatencyTypical SizeBandwidthCost/GB
Registers~0.3 ns~KB~TB/s--
L1 cache~1 ns~64 KB/core~1 TB/s--
L2 cache~4 ns~256 KB--1 MB/core~500 GB/s--
L3 cache~12 ns~30--50 MB (shared)~200 GB/s--
CPU DRAM~100 ns~64--512 GB~50--100 GB/s~$3/GB
GPU HBM3~300 ns~80 GB (H100)~3.35 TB/s~$15/GB
NVMe SSD~10 μ\mus~TB~7 GB/s~$0.1/GB
Network (NVLink)~1 μ\mus--~900 GB/s--
Network (InfiniBand)~1 μ\mus--~400 GB/s--
Algorithm design for ML must respect the memory hierarchy. The key metric is **arithmetic intensity** -- the ratio of FLOPs performed to bytes moved. Operations with high arithmetic intensity (like large matrix multiplications) are *compute-bound* and use the GPU efficiently. Operations with low arithmetic intensity (like element-wise operations, reductions, or small matrix multiplies) are *memory-bound* and limited by bandwidth. The **arithmetic intensity** $I$ of an operation is:

I=FLOPsBytes accessed from memoryI = \frac{\text{FLOPs}}{\text{Bytes accessed from memory}}

An operation is compute-bound if I>II > I^* (the hardware's FLOP/byte ratio) and memory-bound if I<II < I^*. For an NVIDIA H100, I80 TFLOPS/3.35 TB/s24I^* \approx 80\text{ TFLOPS} / 3.35\text{ TB/s} \approx 24 FLOPs/byte for FP32.

**Roofline analysis of matrix multiplication.** For $C = AB$ where $A \in \mathbb{R}^{n \times n}$, $B \in \mathbb{R}^{n \times n}$: - FLOPs: $2n^3$ (each of $n^2$ output entries requires $n$ multiply-adds) - Bytes (if reading each matrix once): $3 \times 4n^2 = 12n^2$ bytes (FP32) - Arithmetic intensity: $I = 2n^3 / 12n^2 = n/6$

For n=4096n = 4096: I683I \approx 683, which far exceeds I24I^* \approx 24, so the operation is compute-bound. For n=32n = 32: I5.3<II \approx 5.3 < I^*, so the operation is memory-bound. This is why batching small operations together improves GPU utilization.

Floating-Point Representation

Real numbers are represented in **IEEE 754** format as:

x=(1)s×(1+f)×2ebiasx = (-1)^s \times (1 + f) \times 2^{e - \text{bias}}

where s{0,1}s \in \{0,1\} is the sign bit, f=0.b1b2bpf = 0.b_1 b_2 \ldots b_p is the fractional part of the significand (with an implicit leading 1 for normalized numbers), ee is the biased exponent, and bias=2w11\text{bias} = 2^{w-1} - 1 for an exponent field of ww bits.

Special values: e=0e = 0 with f=0f = 0 represents ±0\pm 0; e=2w1e = 2^w - 1 with f=0f = 0 represents ±\pm \infty; e=2w1e = 2^w - 1 with f0f \neq 0 represents NaN (not a number).

FormatTotal BitsExponent BitsMantissa BitsRangePrecisionML Use
FP6464115210±308\sim 10^{\pm 308}~15 digitsNumerical verification
FP323282310±38\sim 10^{\pm 38}~7 digitsDefault training, master weights
TF321981010±38\sim 10^{\pm 38}~3 digitsTensor Core matmuls (A100+)
FP16165106.5×104\sim 6.5 \times 10^{4}~3 digitsMixed-precision training
BF16168710±38\sim 10^{\pm 38}~2 digitsPreferred for LLM training
FP8 (E4M3)843±448\pm 448~1 digitInference, forward pass
FP8 (E5M2)85210±4\sim 10^{\pm 4}< 1 digitGradients in FP8 training
INT88----128-128 to 127127ExactPost-training quantization
INT44----8-8 to 77ExactAggressive quantization (GPTQ, AWQ)
BF16 keeps the same 8-bit exponent as FP32, so it can represent the same range of magnitudes. This is critical for training, where gradient magnitudes can vary by many orders of magnitude. FP16's 5-bit exponent limits its range to $\sim 6.5 \times 10^4$, causing overflow in gradients that BF16 handles effortlessly. The tradeoff is precision: BF16 has only 7 mantissa bits (vs. 10 for FP16), so individual values are less accurate, but this rarely matters for stochastic gradient descent.

Machine Epsilon and Numerical Stability

**Machine epsilon** $\epsilon_{\text{mach}}$ is the smallest positive floating-point number $\epsilon$ such that $\text{fl}(1 + \epsilon) \neq 1$, where $\text{fl}(\cdot)$ denotes rounding to the nearest representable float:

ϵmach=2p\epsilon_{\text{mach}} = 2^{-p}

where pp is the number of significand bits (including the implicit leading 1 bit). This means every floating-point operation introduces a relative error bounded by ϵmach/2\epsilon_{\text{mach}}/2.

Formatpp (significand bits)ϵmach\epsilon_{\text{mach}}
FP64531.11×1016\approx 1.11 \times 10^{-16}
FP32245.96×108\approx 5.96 \times 10^{-8}
FP16114.88×104\approx 4.88 \times 10^{-4}
BF1683.91×103\approx 3.91 \times 10^{-3}

Common Numerical Pitfalls in ML

An algorithm is **numerically stable** if small perturbations in its inputs (such as floating-point rounding) produce only small perturbations in its outputs. It is **backward stable** if the computed result is the exact result for a slightly perturbed input.
PitfallExampleFix
Catastrophic cancellationaba - b when aba \approx b loses all significant digitsReformulate: use log(1+x)\log(1+x) via log1p, not log(1+x)
Overflow in softmaxe1000e^{1000} overflows in FP32Shift: softmax(x)i=eximax(x)jexjmax(x)\text{softmax}(x)_i = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}
Underflow in log-probslog(e1000)\log(e^{-1000}) underflows to log(0)=\log(0) = -\inftyUse log-space: logsumexp(x)=max(x)+logjexjmax(x)\text{logsumexp}(x) = \max(x) + \log \sum_j e^{x_j - \max(x)}
Variance accumulationNaive variance formula 1nxi2xˉ2\frac{1}{n}\sum x_i^2 - \bar{x}^2 is unstableUse Welford's online algorithm
Gradient explosionGradients overflow in deep networksGradient clipping, careful initialization, normalization layers
Loss of orthogonalityGram--Schmidt loses orthogonality in finite precisionUse modified Gram--Schmidt or Householder reflections
**Why the softmax shift works.** For logits $z = [1000, 1001, 1002]$: - Naive: $e^{1000}, e^{1001}, e^{1002}$ all overflow to $+\infty$, giving NaN. - Shifted: subtract $\max(z) = 1002$ to get $e^{-2}, e^{-1}, e^{0}$, which are all representable. - The shift does not change the output because $\frac{e^{z_i - c}}{\sum_j e^{z_j - c}} = \frac{e^{z_i}/e^c}{\sum_j e^{z_j}/e^c} = \frac{e^{z_i}}{\sum_j e^{z_j}}$.

GPU Architecture for ML

A modern NVIDIA GPU (e.g., H100) is organized as:
  • Streaming Multiprocessors (SMs): 132 SMs, each containing CUDA cores, Tensor Cores, shared memory, and an L1 cache
  • Tensor Cores: Specialized units that compute small matrix multiplies (e.g., 4×44 \times 4 FP16 matmul) in a single clock cycle
  • Global memory (HBM3): 80 GB at 3.35 TB/s bandwidth, shared across all SMs
  • Shared memory / L1: ~228 KB per SM, programmer-managed scratchpad
  • Warp: 32 threads that execute in lockstep (SIMT model)
PropertyCPU (AMD EPYC 9654)GPU (NVIDIA H100 SXM)
Cores9616,896 CUDA + 528 Tensor Cores
Clock speed~3.7 GHz~1.8 GHz
Peak FP32 TFLOPS~5~67
Peak FP16/BF16 TFLOPS~10~990 (Tensor Core)
Peak INT8 TOPS~20~1,979 (Tensor Core)
Memory768 GB DDR580 GB HBM3
Memory bandwidth~460 GB/s~3,350 GB/s
TDP360W700W
**Tensor Cores** are the key to GPU performance for ML. They compute $D = A \times B + C$ for small matrices (e.g., $16 \times 16 \times 16$) in a single operation, fusing multiply and add. This is why operations that can be expressed as matrix multiplications (linear layers, attention, convolutions via im2col) are fast, while operations that cannot (layer norm, dropout, activation functions) are memory-bound. Matrix multiplication $C = AB$ where $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$ requires $2mkn$ FLOPs but only reads/writes $mk + kn + mn$ elements. The arithmetic intensity $I = \frac{2mkn}{4(mk + kn + mn)}$ (for FP32) grows with matrix dimensions, making it ideal for GPU parallelism. This is one reason why larger batch sizes and wider layers train faster per element -- they produce larger, more compute-bound matrix multiplications.

The Compute--Memory Tradeoff in ML

Many ML design decisions are fundamentally about trading compute for memory or vice versa:
TechniqueSavesCosts
Gradient checkpointingActivation memory (by L\sqrt{L})Recomputes forward pass (~33% overhead)
Mixed precisionMemory and computeSlight accuracy risk (mitigated by loss scaling)
Operator fusionMemory bandwidth (eliminates intermediate writes)Kernel development effort
FlashAttentionMemory (O(T2)O(T)O(T^2) \to O(T))Slightly more FLOPs (recomputation in backward)
QuantizationMemory and bandwidthAccuracy degradation

Understanding this tradeoff is central to ML systems engineering.

Notation Summary

SymbolMeaning
s,f,es, f, eSign, fractional significand, exponent in IEEE 754
ppNumber of significand bits
ϵmach\epsilon_{\text{mach}}Machine epsilon
IIArithmetic intensity (FLOPs/byte)
II^*Hardware's compute-to-bandwidth ratio
FLOPSFloating-point operations per second
TFLOPS101210^{12} FLOPS
HBMHigh Bandwidth Memory (GPU)
SMStreaming Multiprocessor
SIMTSingle Instruction, Multiple Threads