Optimization
The difference between a naive CUDA kernel and an optimized one can be 10-100x. This chapter covers the key optimization techniques: eliminating warp divergence, tiling for data reuse, parallel reduction, asynchronous execution with streams, and profiling with NVIDIA tools. Each technique addresses a specific performance bottleneck.
Warp Divergence
All 32 threads in a warp execute the same instruction. When threads take different branches, both paths are executed with threads masked:
// Divergent: threads in the same warp take different branches
if (threadIdx.x % 2 == 0) {
// Even threads execute this (odd threads are idle)
a[tid] = expensive_function(a[tid]);
} else {
// Odd threads execute this (even threads are idle)
a[tid] = other_function(a[tid]);
}
// Total time: cost of BOTH branches (not one)
// Better: group work by warp boundaries (32 threads per warp)
if (threadIdx.x / 32 == 0) {
// Entire first warp takes this path (no divergence)
a[tid] = expensive_function(a[tid]);
} else {
// Entire second warp takes this path (no divergence)
a[tid] = other_function(a[tid]);
}
// Total time: max(cost of branch A, cost of branch B)
// Each warp executes only ONE branch
Tiling for Matrix Multiplication
Naive matrix multiply has poor data reuse. Tiling loads sub-matrices into shared memory:
#define TILE 16
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE; t++) {
// Load tiles into shared memory
As[threadIdx.y][threadIdx.x] = A[row * N + t * TILE + threadIdx.x];
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];
__syncthreads();
// Compute partial dot product from shared memory
for (int k = 0; k < TILE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = sum;
}
The tiled version reads each element from global memory once per tile instead of once per output element, reducing global memory accesses by a factor of TILE.
| Version | Global Memory Reads per Output | FLOPS per Read | Arithmetic Intensity |
|---|---|---|---|
| Naive () | reads per output element | 1 FLOP per read | 0.25 FLOPS/byte |
| Tiled ( tiles) | reads per output element | FLOPS per read | FLOPS/byte |
| Tiled () | reads | 16 FLOPS per read | 2 FLOPS/byte |
| Tiled () | reads | 32 FLOPS per read | 4 FLOPS/byte |
For a 1024x1024 matrix with : global memory reads decrease by 16x compared to naive. On H100, the ridge point (where compute meets bandwidth) for FP32 is ~20 FLOPS/byte, so even tiled FP32 GEMM is memory-bound. Using Tensor Cores (FP16) shifts the ridge point, making tiled GEMM compute-bound.
Parallel Reduction
Summing an array in parallel:
__global__ void reduce_sum(float* input, float* output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? input[i] : 0.0f;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = sdata[0];
}
// Warp-level reduction (no shared memory needed)
float val = ...;
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
// Thread 0 of each warp now has the sum
CUDA Streams
Streams enable concurrent execution of kernels and memory copies:
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// These can execute concurrently
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1);
kernel1<<<grid, block, 0, stream1>>>(d_a);
cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream2);
kernel2<<<grid, block, 0, stream2>>>(d_b);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
Profiling
# NVIDIA Nsight Compute (kernel-level profiling)
ncu --set full ./my_program
# Key metrics:
# - SM Utilization (target: >80%)
# - Memory Throughput (% of peak)
# - Achieved Occupancy
# - Warp Stall Reasons
# NVIDIA Nsight Systems (system-level timeline)
nsys profile -o timeline ./my_program
nsys stats timeline.nsys-rep
| Metric | What It Tells You | Good Value | Poor Value |
|---|---|---|---|
| SM Utilization (%) | How busy the SMs are | > 80% | < 50% (low occupancy or idle GPU) |
| Memory Bandwidth (%) | Fraction of peak HBM bandwidth used | > 60% (memory-bound) | < 20% (not enough data flow) |
| Achieved Occupancy | Active warps / max warps per SM | > 50% | < 25% (too many registers or shared memory) |
| L2 Hit Rate | Cache effectiveness | > 50% | < 20% (poor data reuse) |
| Warp Stall: Memory | Stalled waiting for memory | < 30% | > 50% (memory-bound; need fusion or tiling) |
| Warp Stall: Sync | Stalled at __syncthreads | < 10% | > 20% (load imbalance in shared memory) |
| Inst Executed per Cycle | Instructions throughput | > 1.0 | < 0.5 (underutilized execution units) |
Optimization Checklist
| Priority | Optimization | Bottleneck Addressed | Expected Speedup |
|---|---|---|---|
| 1 | Coalesced memory access | Memory bandwidth waste | 2-10x |
| 2 | Use shared memory for data reuse | Global memory latency | 2-5x |
| 3 | Eliminate warp divergence | Execution efficiency | 1.5-2x |
| 4 | Avoid bank conflicts | Shared memory throughput | 1.2-2x |
| 5 | Use warp-level primitives (__shfl) | Reduction/communication overhead | 1.2-1.5x |
| 6 | Tune occupancy (registers vs threads) | Latency hiding | 1.1-1.5x |
| 7 | Use Tensor Cores (for GEMM) | Compute throughput | 8-16x |
| 8 | Async memory copy (streams) | Pipeline utilization | 1.5-2x |
| 9 | Launch configuration tuning | SM utilization | 1.1-1.3x |