CUDA and GPU Programming
This chapter covers GPU programming from first principles: writing CUDA kernels, understanding the memory model, optimization techniques, higher-level tools like Triton, and integrating custom ops into PyTorch.
- Your First Kernel: Hello world, thread indexing, vector addition, compiling with nvcc
- Memory Model: Global, shared, and register memory, coalescing, bank conflicts, occupancy
- Optimization: Warp divergence, tiling, reduction, streams, profiling
- Triton: OpenAI Triton, block-level programming, fused softmax, matmul kernel
- Custom PyTorch Ops: C++ extensions, CUDA kernels from Python, torch.library