CUDA and GPU Programming

This chapter covers GPU programming from first principles: writing CUDA kernels, understanding the memory model, optimization techniques, higher-level tools like Triton, and integrating custom ops into PyTorch.

Your First Kernel: Hello world, thread indexing, vector addition, compiling with nvcc
Memory Model: Global, shared, and register memory, coalescing, bank conflicts, occupancy
Optimization: Warp divergence, tiling, reduction, streams, profiling
Triton: OpenAI Triton, block-level programming, fused softmax, matmul kernel
Custom PyTorch Ops: C++ extensions, CUDA kernels from Python, torch.library