Skip to main content

Memory Model

Understanding the GPU memory hierarchy is the single most important skill for writing fast CUDA kernels. The difference between a naive kernel and an optimized one is often 10-100x, and nearly all of that difference comes from how data is accessed. This chapter covers the memory hierarchy, coalescing rules, shared memory, bank conflicts, and occupancy.

Memory Hierarchy

MemoryScopeSize (H100)LatencyBandwidthRead/Write
RegistersPer thread~255 x 32-bit per thread~1 cycle~20 TB/sR/W
Shared MemoryPer blockUp to 228 KB per SM~20-30 cycles~12 TB/sR/W
L1 CachePer SMCombined with shared~30 cyclesAutomaticRead (mostly)
L2 CachePer device50 MB~30-100 cycles~12 TB/sR/W
Global Memory (HBM)Per device80 GB~300 cycles~3350 GB/sR/W
Constant MemoryPer device64 KB~5 cycles (cached)HighRead only
Texture MemoryPer device= Global~5 cycles (cached)High (cached)Read only
**The optimization strategy is always the same:** move data from slow memory to fast memory, use it as many times as possible, then discard it. Specifically:
  1. Global -> Shared: Load a tile of data into shared memory. Cost: ~300 cycles per element.
  2. Shared -> Registers: Each thread loads its portion. Cost: ~20 cycles per element.
  3. Registers -> Compute: Perform operations. Cost: ~1 cycle per operation.
  4. Registers -> Shared -> Global: Write results back out.

The arithmetic intensity of your kernel determines which level matters most. For memory-bound kernels, minimizing global memory accesses is paramount. For compute-bound kernels (GEMM), maximizing register reuse is key.

Global Memory and Coalescing

Global memory (HBM) is the main GPU memory. Access happens in 128-byte transactions (or 32-byte L2 sectors). When 32 threads in a warp access consecutive addresses, the hardware coalesces them into a minimal number of transactions:


// COALESCED: threads access consecutive 4-byte elements (FAST)
// Thread 0 reads a[0], thread 1 reads a[1], ..., thread 31 reads a[31]
// -> One 128-byte transaction
float val = a[threadIdx.x + blockIdx.x * blockDim.x];

// STRIDED: threads access every N-th element (SLOW)
// Thread 0 reads a[0], thread 1 reads a[stride], thread 2 reads a[2*stride]...
// -> Up to 32 separate 128-byte transactions (one per thread)
float val = a[(threadIdx.x + blockIdx.x * blockDim.x) * stride];

// RANDOM: threads access arbitrary addresses (WORST)
// Each thread may trigger its own transaction
float val = a[indices[threadIdx.x]];
PatternExampleTransactions per Warp (32 floats)Efficiency
Coalesced (stride 1)a[tid]1 (128 bytes)100%
Stride 2a[tid * 2]250%
Stride 32a[tid * 32]323%
Randoma[random[tid]]Up to 32~3-10%
Broadcast (all same)a[0]1100% (but no parallelism)
**Tip:** Always access global memory in a coalesced pattern. For 2D data stored in row-major order, iterate over columns in the inner loop (adjacent threads read adjacent memory).

Shared Memory

Shared memory is fast, on-chip memory accessible by all threads in a block:


__global__ void shmem_example(float* input, float* output, int n) {
__shared__ float tile[256]; // Shared by all threads in this block

int tid = threadIdx.x + blockIdx.x * blockDim.x;

// Load from global to shared
if (tid < n) {
tile[threadIdx.x] = input[tid];
}
__syncthreads(); // Barrier: wait for all threads to finish loading

// Now all threads can read any element in the tile
if (tid < n && threadIdx.x > 0) {
output[tid] = tile[threadIdx.x] + tile[threadIdx.x - 1];
}
}
**Warning:** Always call `__syncthreads()` after writing to shared memory and before reading. Without it, some threads may read stale data.

Bank Conflicts

Shared memory is divided into 32 banks. Consecutive 4-byte words go to consecutive banks. If two threads in a warp access the same bank (but different addresses), the accesses are serialized:


// No bank conflict: consecutive access
tile[threadIdx.x] // Thread k accesses bank k % 32

// Bank conflict: stride-2 access
tile[threadIdx.x * 2] // Thread 0 and 16 both access bank 0

// No conflict: broadcast (same address)
tile[0] // All threads read the same address (broadcast)
**Padding to avoid bank conflicts.** The classic fix: add one extra element per row:
// CONFLICT: 32 threads accessing column 0 of a 32-wide array
// Thread k accesses tile[k][0] -> bank = (k * 32) % 32 = 0 (all same bank!)
__shared__ float tile[32][32];

// NO CONFLICT: padding shifts each row by 1 bank
// Thread k accesses tile[k][0] -> bank = (k * 33) % 32 = k (all different banks!)
__shared__ float tile[32][33]; // Extra column for padding

This wastes 32 * 4 = 128 bytes of shared memory but eliminates all bank conflicts for column access patterns.

Registers

Registers are the fastest memory but also the most limited. Each thread can use up to 255 32-bit registers on modern GPUs. Using more registers per thread reduces the number of threads that can be active on an SM (occupancy), which reduces the GPU's ability to hide memory latency.


// Few registers: high occupancy, good for memory-bound kernels
__global__ void low_register_kernel(float* a, float* b, float* c, int n) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) c[tid] = a[tid] + b[tid];
// Uses ~8 registers per thread -> high occupancy
}

// Many registers: low occupancy, good for compute-bound kernels
__global__ void high_register_kernel(float* A, float* B, float* C, int N) {
// Each thread computes a 4x4 tile, accumulating in 16 register variables
float acc[4][4] = {0}; // 16 float registers just for accumulation
float a_frag[4]; // 4 registers for A fragment
float b_frag[4]; // 4 registers for B fragment
// ... tiling logic ...
// Uses ~40+ registers per thread -> lower occupancy, but higher compute throughput
}

// Control register usage with launch bounds
__global__ void __launch_bounds__(256, 2) // 256 threads/block, min 2 blocks/SM
my_kernel(...) {
// Compiler will limit registers to fit 2 blocks of 256 threads per SM
}

Occupancy

Occupancy is the ratio of active warps to the maximum warps per SM. Higher occupancy helps hide memory latency:

Occupancy=Active warps per SMMax warps per SM\text{Occupancy} = \frac{\text{Active warps per SM}}{\text{Max warps per SM}}

Occupancy is limited by three resources:

ResourceLimit per SM (H100)How It Limits OccupancyDiagnosis
Threads2048 (64 warps)More threads/block = fewer blocks fitncu --set full reports achieved occupancy
Registers65536 x 32-bitMore registers/thread = fewer threadsncu reports register usage per thread
Shared memory228 KBMore shared/block = fewer blocks fitncu reports shared memory per block
**Occupancy vs. efficiency tradeoff.** Higher occupancy helps the GPU hide memory latency by having more warps to switch between. But lower occupancy can be faster if each thread does more useful work per memory access:
ApproachOccupancyPer-Thread EfficiencyBest For
Many threads, few registersHigh (75-100%)Low (simple computation)Memory-bound kernels (elementwise, reduction)
Fewer threads, many registersLow (25-50%)High (complex computation)Compute-bound kernels (tiled GEMM)
BalanceMedium (50-75%)MediumMost practical kernels

Use Nsight Compute's occupancy calculator to find the sweet spot. Often, 50% occupancy with good register usage outperforms 100% occupancy with register spilling.