Skip to main content

CUDA and GPU Programming

This chapter covers GPU programming from first principles: writing CUDA kernels, understanding the memory model, optimization techniques, higher-level tools like Triton, and integrating custom ops into PyTorch.

  • Your First Kernel: Hello world, thread indexing, vector addition, compiling with nvcc
  • Memory Model: Global, shared, and register memory, coalescing, bank conflicts, occupancy
  • Optimization: Warp divergence, tiling, reduction, streams, profiling
  • Triton: OpenAI Triton, block-level programming, fused softmax, matmul kernel
  • Custom PyTorch Ops: C++ extensions, CUDA kernels from Python, torch.library