CPU and GPU Architecture
CPUs and GPUs solve fundamentally different problems. A CPU is optimized to execute a single thread of instructions as fast as possible -- minimizing the latency of each operation. A GPU is optimized to execute millions of threads simultaneously -- maximizing the throughput of parallel operations. Understanding this distinction is essential for writing efficient ML code, because it determines which operations are fast, which are slow, and why.
CPU: Optimized for Latency
A CPU core is a marvel of complexity engineering, designed to make one thread run as fast as possible:
- Out-of-order execution -- reorders instructions to keep all execution units busy, even when some instructions are waiting for data
- Branch prediction -- guesses which way branches go to avoid pipeline stalls (>95% accuracy on typical code)
- Speculative execution -- executes instructions past predicted branches before the branch is resolved
- Large caches -- L1/L2/L3 hierarchy absorbs most memory accesses, reducing effective latency
- Few powerful cores -- 8-128 cores, each running 1-2 hardware threads (hyperthreading)
CPU Core (x 64-128 cores per chip):
+------------------------------------------+
| Branch Predictor | Instruction Fetch |
| Decode (4-6 wide) | Rename/Allocate |
|--------------------------------------------
| Execution Units: |
| [ALU] [ALU] [ALU] [AGU] [FPU] [SIMD] |
| Out-of-Order Scheduler (up to 256 ops) |
|--------------------------------------------
| L1 Data Cache: 48-64 KB (~1 ns) |
| L1 Inst Cache: 32-64 KB |
| L2 Cache: 256 KB-2 MB (~4 ns) |
+------------------------------------------+
|
L3 Cache: 32-64 MB shared (~12 ns)
|
DRAM: 256-2048 GB (~100 ns, ~50 GB/s per channel)
On a GPU, these proportions are inverted: most of the die is computation (CUDA cores, Tensor Cores), with minimal caching and no branch prediction or out-of-order logic. This is why a GPU die can fit thousands of simple cores where a CPU die fits dozens of complex ones.
GPU: Optimized for Throughput
A GPU trades single-thread performance for massive parallelism. The key design principle is latency hiding through parallelism: instead of making each thread fast (CPU approach), the GPU runs so many threads that it always has work to do while some threads are waiting for memory.
- Simple cores -- no branch prediction, no out-of-order execution, minimal caches
- Massive parallelism -- thousands of threads executing simultaneously, with tens of thousands more ready to be scheduled
- Wide SIMT execution -- 32 threads (a warp) execute the same instruction in lockstep
- High bandwidth memory -- HBM provides 2-3+ TB/s vs ~50 GB/s for CPU DRAM (a 40-60x advantage)
- Hardware thread scheduling -- the GPU switches between warps every cycle at zero cost (no context switch overhead)
GPU Streaming Multiprocessor (SM):
+------------------------------------------------------+
| Warp Scheduler 0 | Warp Scheduler 1 |
| Dispatch Unit | Dispatch Unit |
|------------------------------------------------------|
| 32 FP32 Cores | 32 FP32 Cores | 16 FP64 Cores |
| 16 INT32 Cores| 16 INT32 Cores |
|------------------------------------------------------|
| 4th-gen Tensor Cores (FP8/FP16/BF16/TF32/INT8) |
| each: 256 FP16 FMAs/cycle or 512 FP8 FMAs/cycle |
|------------------------------------------------------|
| Shared Memory / L1 Cache: 228 KB (configurable) |
| Register File: 256 KB (65536 x 32-bit registers) |
| L0 Instruction Cache |
+------------------------------------------------------+
x 132 SMs on H100 SXM
= 16896 FP32 cores total
~1000 TFLOPS FP8, ~990 TFLOPS FP16 (with sparsity)
This means GPU performance depends critically on occupancy -- the ratio of active warps to maximum warps per SM. Low occupancy (too few threads, too many registers per thread) means the GPU cannot hide memory latency, and performance drops.
CPU vs GPU: A Quantitative Comparison
| Property | CPU (AMD EPYC 9654) | GPU (NVIDIA H100 SXM) | Ratio |
|---|---|---|---|
| Cores | 96 | 16,896 FP32 | 176x |
| Clock speed | 2.4 GHz (base) | 1.83 GHz | 0.76x |
| FP32 throughput | ~4.6 TFLOPS | ~67 TFLOPS | 15x |
| FP16 throughput | ~9.2 TFLOPS (AVX-512) | ~990 TFLOPS (Tensor Cores) | 108x |
| Memory bandwidth | ~460 GB/s (12ch DDR5) | ~3350 GB/s (HBM3) | 7x |
| Memory capacity | 768 GB (max) | 80 GB | 0.1x |
| Memory latency | ~80 ns | ~300 ns (but hidden) | 3.75x |
| Power | 360W | 700W | 2x |
| FLOPS/watt (FP16) | ~26 TFLOPS/W | ~1414 TFLOPS/W | 54x |
Thread Hierarchy (CUDA)
CUDA organizes computation in a hierarchy that maps to the GPU hardware:
Grid (entire kernel launch)
|
+-- Block (0,0) Block (1,0) Block (2,0)
| | | |
| +-- Warp 0 +-- Warp 0 ...
| | Thread 0-31 | Thread 0-31
| +-- Warp 1 +-- Warp 1
| | Thread 32-63 | Thread 32-63
| +-- Warp 2 +-- Warp 2
| | Thread 64-95 | Thread 64-95
| ... ...
|
+-- Block (0,1) Block (1,1) ...
...
Hardware mapping:
Block --> assigned to one SM (cannot span SMs)
Warp --> scheduled by warp scheduler
Thread --> executed on a CUDA core
| Level | Size | Shared Resources | Hardware Mapping |
|---|---|---|---|
| Thread | 1 | Registers (up to 255 x 32-bit per thread) | One CUDA core |
| Warp | 32 threads | Lock-step execution (SIMT) | One warp scheduler |
| Block | Up to 1024 threads (32 warps) | Shared memory (up to 228 KB), __syncthreads() | One SM |
| Grid | Up to blocks | Global memory (80 GB HBM) | Entire GPU |
| Block Size | Threads | Warps | Notes |
|---|---|---|---|
| 32 | 32 | 1 | Minimum useful; wastes scheduler capacity |
| 128 | 128 | 4 | Good for register-heavy kernels |
| 256 | 256 | 8 | Common default; good balance |
| 512 | 512 | 16 | Good for shared-memory-heavy kernels |
| 1024 | 1024 | 32 | Maximum; limits registers per thread to ~32 |
Rules of thumb:
- Always use a multiple of 32 (warp size). Non-multiples waste execution slots.
- Use at least 128 threads per block to give the warp scheduler enough work.
- Use 256 as the default unless you have a reason to change it.
- Reduce block size if your kernel uses many registers or much shared memory (to increase occupancy by fitting more blocks per SM).
// BAD: threads in the same warp diverge
if (threadIdx.x % 2 == 0) {
do_something(); // half the warp active
} else {
do_other_thing(); // other half active
}
// Total time: time(do_something) + time(do_other_thing)
// BETTER: diverge at warp boundaries
if (threadIdx.x / 32 % 2 == 0) {
do_something(); // entire warp active
} else {
do_other_thing(); // entire warp active
}
// Total time: max(time(do_something), time(do_other_thing))
In ML kernels, warp divergence typically arises from boundary checks (e.g., last tile in a matrix) or data-dependent operations (e.g., sparse attention). Minimizing these branches is a key optimization.
GPU Memory Architecture
+------------------------------------------------------------------+
| Global Memory (HBM3) |
| 80 GB, ~3350 GB/s bandwidth |
| ~300 cycle latency |
+------------------------------------------------------------------+
| | |
v v v
+--L2 Cache: 50 MB, ~12 TB/s bandwidth, ~30 cycles-----------------+
| | |
+---------------+ +---------------+ +---------------+
| SM 0 | | SM 1 | | SM 2 |
| Shared Memory | | Shared Memory | | Shared Memory |
| up to 228 KB | | up to 228 KB | | up to 228 KB |
| ~20-30 cycles | | ~20-30 cycles | | ~20-30 cycles |
| | | | | |
| Register File | | Register File | | Register File |
| 256 KB | | 256 KB | | 256 KB |
| ~1 cycle | | ~1 cycle | | ~1 cycle |
+---------------+ +---------------+ +---------------+
| Memory | Scope | Latency | Size | Bandwidth | ML Use |
|---|---|---|---|---|---|
| Registers | Thread | ~1 cycle | 256 KB/SM | ~20 TB/s | Intermediate values, loop variables |
| Shared Memory | Block | ~20-30 cycles | Up to 228 KB/SM | ~12 TB/s | Tile buffers for GEMM, reduction intermediates |
| L1 Cache | SM | ~30 cycles | Combined with shared | Automatic | Transparent caching of global reads |
| L2 Cache | Device | ~30 cycles | 50 MB (H100) | ~12 TB/s | Automatic caching layer |
| Global (HBM) | Device | ~300 cycles | 80 GB | ~3350 GB/s | Tensors, weights, activations |
| Constant Memory | Device | ~5 cycles (cached) | 64 KB | High (if cached) | Hyperparameters, lookup tables |
- Tiled GEMM: Load tiles of A and B into shared memory, compute the partial product, and accumulate across tiles. Each element is loaded once from global memory but used times in computation.
- FlashAttention: Process Q, K, V in tiles that fit in shared memory, computing softmax statistics incrementally.
- Reduction kernels: First reduce within shared memory (fast), then across blocks via global memory (slow but infrequent).
The shared memory vs. L1 cache split is configurable on most GPUs (e.g., cudaFuncSetAttribute(..., cudaFuncAttributePreferredSharedMemoryCarveout, 100) allocates all shared/L1 capacity to shared memory).
The Roofline Model
The maximum performance is:
Operations with low AI are memory-bound (limited by how fast data can be loaded). Operations with high AI are compute-bound (limited by how fast the hardware can compute).
Performance (TFLOPS)
|
990 | ____________________________ Peak FP16 Compute (H100)
| /
| /
67 | /____________________________ Peak FP32 Compute
| /
| /
| / Memory-bound Compute-bound
| / regime regime
| /
| / Slope = Memory Bandwidth (3350 GB/s)
| /
| /
| /
+--------+------+--+--------------------->
1 ~20 ~295 Arithmetic Intensity (FLOPS/byte)
Ridge points (FP32, FP16)
| Operation | AI (FLOPS/byte) | Bound | Why |
|---|---|---|---|
| Element-wise (ReLU, GELU) | 0.25 | Memory | 1 FLOP per 4 bytes loaded |
| Softmax | ~0.5 | Memory | Few FLOPs per element, multiple passes |
| Layer normalization | ~1 | Memory | Mean + variance + normalize |
| Reduction (sum, mean) | 0.25 | Memory | 1 add per 4 bytes |
| Batch norm | ~2 | Memory | Statistics + normalize |
| Attention () | ~ | Depends on | Compute grows with head dim |
| GEMM () | ~ | Compute (usually) | High reuse of loaded data |
| Conv2D | 10-100+ | Compute | High reuse via im2col or direct |
If your operation is compute-bound (GEMM), the fix is to use Tensor Cores (FP16/BF16) for a 16x speedup, or FP8 for 32x.
Tensor Cores
| Generation | GPU | Tile Size | Precisions | Peak TFLOPS |
|---|---|---|---|---|
| 1st gen | V100 | 4x4x4 | FP16 | 125 |
| 2nd gen | A100 | 8x4x8 | FP16, BF16, TF32, INT8, FP64 | 312 (FP16) |
| 3rd gen | H100 (Hopper) | 16x8x16 | FP16, BF16, TF32, FP8, INT8 | 990 (FP16), 1979 (FP8) |
| 4th gen | B200 (Blackwell) | 16x16x16 | FP16, BF16, TF32, FP8, FP4 | 2250 (FP16), 4500 (FP4) |
In PyTorch, you enable Tensor Cores by:
# Use FP16 or BF16 tensors
A = torch.randn(1024, 1024, device='cuda', dtype=torch.float16)
B = torch.randn(1024, 1024, device='cuda', dtype=torch.float16)
C = torch.matmul(A, B) # Uses Tensor Cores automatically
# Or use torch.autocast for mixed precision
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
output = model(input)
If you are using FP32 tensors on H100, you are leaving 15-30x performance on the table. Always use mixed precision unless you have a specific numerical reason to use FP32.