Every ML algorithm ultimately runs on physical hardware that represents real numbers with finite precision and moves data through a memory hierarchy. Understanding these constraints is not optional -- they determine which algorithms are fast, which are numerically stable, and why certain design choices (batch size, precision, operator fusion) matter.
Modern CPUs follow the **Von Neumann model**: a single shared memory stores both instructions and data, connected to a processing unit (ALU), a control unit, and I/O devices via buses. The processor fetches instructions from memory, decodes them, and executes them sequentially (with pipelining and out-of-order execution to hide latency).
The key performance bottleneck is the memory wall -- processor speed has improved roughly 1000× faster than memory latency over the past four decades. As a result, modern computation is almost always limited by data movement, not arithmetic.
To mitigate the memory wall, hardware uses a hierarchy of progressively larger but slower memories:
Level
Latency
Typical Size
Bandwidth
Cost/GB
Registers
~0.3 ns
~KB
~TB/s
--
L1 cache
~1 ns
~64 KB/core
~1 TB/s
--
L2 cache
~4 ns
~256 KB--1 MB/core
~500 GB/s
--
L3 cache
~12 ns
~30--50 MB (shared)
~200 GB/s
--
CPU DRAM
~100 ns
~64--512 GB
~50--100 GB/s
~$3/GB
GPU HBM3
~300 ns
~80 GB (H100)
~3.35 TB/s
~$15/GB
NVMe SSD
~10 μs
~TB
~7 GB/s
~$0.1/GB
Network (NVLink)
~1 μs
--
~900 GB/s
--
Network (InfiniBand)
~1 μs
--
~400 GB/s
--
Algorithm design for ML must respect the memory hierarchy. The key metric is **arithmetic intensity** -- the ratio of FLOPs performed to bytes moved. Operations with high arithmetic intensity (like large matrix multiplications) are *compute-bound* and use the GPU efficiently. Operations with low arithmetic intensity (like element-wise operations, reductions, or small matrix multiplies) are *memory-bound* and limited by bandwidth.
The **arithmetic intensity** $I$ of an operation is:
I=Bytes accessed from memoryFLOPs
An operation is compute-bound if I>I∗ (the hardware's FLOP/byte ratio) and memory-bound if I<I∗. For an NVIDIA H100, I∗≈80 TFLOPS/3.35 TB/s≈24 FLOPs/byte for FP32.
For n=4096: I≈683, which far exceeds I∗≈24, so the operation is compute-bound. For n=32: I≈5.3<I∗, so the operation is memory-bound. This is why batching small operations together improves GPU utilization.
Real numbers are represented in **IEEE 754** format as:
x=(−1)s×(1+f)×2e−bias
where s∈{0,1} is the sign bit, f=0.b1b2…bp is the fractional part of the significand (with an implicit leading 1 for normalized numbers), e is the biased exponent, and bias=2w−1−1 for an exponent field of w bits.
Special values: e=0 with f=0 represents ±0; e=2w−1 with f=0 represents ±∞; e=2w−1 with f=0 represents NaN (not a number).
Format
Total Bits
Exponent Bits
Mantissa Bits
Range
Precision
ML Use
FP64
64
11
52
∼10±308
~15 digits
Numerical verification
FP32
32
8
23
∼10±38
~7 digits
Default training, master weights
TF32
19
8
10
∼10±38
~3 digits
Tensor Core matmuls (A100+)
FP16
16
5
10
∼6.5×104
~3 digits
Mixed-precision training
BF16
16
8
7
∼10±38
~2 digits
Preferred for LLM training
FP8 (E4M3)
8
4
3
±448
~1 digit
Inference, forward pass
FP8 (E5M2)
8
5
2
∼10±4
< 1 digit
Gradients in FP8 training
INT8
8
--
--
−128 to 127
Exact
Post-training quantization
INT4
4
--
--
−8 to 7
Exact
Aggressive quantization (GPTQ, AWQ)
BF16 keeps the same 8-bit exponent as FP32, so it can represent the same range of magnitudes. This is critical for training, where gradient magnitudes can vary by many orders of magnitude. FP16's 5-bit exponent limits its range to $\sim 6.5 \times 10^4$, causing overflow in gradients that BF16 handles effortlessly. The tradeoff is precision: BF16 has only 7 mantissa bits (vs. 10 for FP16), so individual values are less accurate, but this rarely matters for stochastic gradient descent.
**Machine epsilon** $\epsilon_{\text{mach}}$ is the smallest positive floating-point number $\epsilon$ such that $\text{fl}(1 + \epsilon) \neq 1$, where $\text{fl}(\cdot)$ denotes rounding to the nearest representable float:
ϵmach=2−p
where p is the number of significand bits (including the implicit leading 1 bit). This means every floating-point operation introduces a relative error bounded by ϵmach/2.
An algorithm is **numerically stable** if small perturbations in its inputs (such as floating-point rounding) produce only small perturbations in its outputs. It is **backward stable** if the computed result is the exact result for a slightly perturbed input.
Pitfall
Example
Fix
Catastrophic cancellation
a−b when a≈b loses all significant digits
Reformulate: use log(1+x) via log1p, not log(1+x)
Overflow in softmax
e1000 overflows in FP32
Shift: softmax(x)i=∑jexj−max(x)exi−max(x)
Underflow in log-probs
log(e−1000) underflows to log(0)=−∞
Use log-space: logsumexp(x)=max(x)+log∑jexj−max(x)
Gram--Schmidt loses orthogonality in finite precision
Use modified Gram--Schmidt or Householder reflections
**Why the softmax shift works.** For logits $z = [1000, 1001, 1002]$:
- Naive: $e^{1000}, e^{1001}, e^{1002}$ all overflow to $+\infty$, giving NaN.
- Shifted: subtract $\max(z) = 1002$ to get $e^{-2}, e^{-1}, e^{0}$, which are all representable.
- The shift does not change the output because $\frac{e^{z_i - c}}{\sum_j e^{z_j - c}} = \frac{e^{z_i}/e^c}{\sum_j e^{z_j}/e^c} = \frac{e^{z_i}}{\sum_j e^{z_j}}$.
Streaming Multiprocessors (SMs): 132 SMs, each containing CUDA cores, Tensor Cores, shared memory, and an L1 cache
Tensor Cores: Specialized units that compute small matrix multiplies (e.g., 4×4 FP16 matmul) in a single clock cycle
Global memory (HBM3): 80 GB at 3.35 TB/s bandwidth, shared across all SMs
Shared memory / L1: ~228 KB per SM, programmer-managed scratchpad
Warp: 32 threads that execute in lockstep (SIMT model)
Property
CPU (AMD EPYC 9654)
GPU (NVIDIA H100 SXM)
Cores
96
16,896 CUDA + 528 Tensor Cores
Clock speed
~3.7 GHz
~1.8 GHz
Peak FP32 TFLOPS
~5
~67
Peak FP16/BF16 TFLOPS
~10
~990 (Tensor Core)
Peak INT8 TOPS
~20
~1,979 (Tensor Core)
Memory
768 GB DDR5
80 GB HBM3
Memory bandwidth
~460 GB/s
~3,350 GB/s
TDP
360W
700W
**Tensor Cores** are the key to GPU performance for ML. They compute $D = A \times B + C$ for small matrices (e.g., $16 \times 16 \times 16$) in a single operation, fusing multiply and add. This is why operations that can be expressed as matrix multiplications (linear layers, attention, convolutions via im2col) are fast, while operations that cannot (layer norm, dropout, activation functions) are memory-bound.
Matrix multiplication $C = AB$ where $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$ requires $2mkn$ FLOPs but only reads/writes $mk + kn + mn$ elements. The arithmetic intensity $I = \frac{2mkn}{4(mk + kn + mn)}$ (for FP32) grows with matrix dimensions, making it ideal for GPU parallelism. This is one reason why larger batch sizes and wider layers train faster per element -- they produce larger, more compute-bound matrix multiplications.