CUDA and GPU Programming
This chapter covers GPU programming from first principles: writing CUDA kernels, understanding the memory model, optimization techniques, and higher-level tools like Triton.
- Your First Kernel -- Hello world, thread indexing, vector addition, compiling with nvcc
- Memory Model -- Global, shared, and register memory, coalescing, bank conflicts, occupancy
- Optimization -- Warp divergence, tiling, reduction, streams, profiling
- Triton -- OpenAI Triton, block-level programming, fused softmax, matmul kernel
- Custom PyTorch Ops -- C++ extensions, CUDA kernels from Python, torch.library