Skip to main content

Memory and Stack

Every running program uses memory organized into distinct regions with different lifetimes, access speeds, and purposes. Understanding this layout -- the stack for function calls, the heap for dynamic allocations, virtual memory for isolation, and cache lines for performance -- is essential for debugging crashes in C++/CUDA extensions, understanding why PyTorch allocators exist, and writing code that cooperates with the memory hierarchy rather than fighting it.

Process Memory Layout

Every process sees a flat 64-bit virtual address space. The OS maps this to physical memory through page tables, giving each process the illusion of owning all of memory:


High addresses (0x00007FFFFFFFFFFF)
+---------------------------+
| Stack | <- grows downward (toward lower addresses)
| (function frames, | RSP points to current top
| local variables) | Default limit: 8 MB per thread
+---------------------------+
| | |
| v |
| (unmapped gap) |
| ^ |
| | |
+---------------------------+
| Memory-mapped region | <- mmap, shared libraries, file mappings
+---------------------------+
| ^ |
| | |
| Heap | <- grows upward (toward higher addresses)
| (malloc, new, PyTorch | managed by allocator (glibc, jemalloc, etc.)
| CPU tensor storage) |
+---------------------------+
| BSS (.bss) | <- uninitialized globals (zero-filled)
| Data (.data) | <- initialized globals
| Read-only data (.rodata) | <- string literals, constants
| Code (.text) | <- executable instructions
+---------------------------+
| (unmapped) | <- NULL dereference caught here (SIGSEGV)
Low addresses (0x0000000000000000)
**Where ML data lives.** PyTorch CPU tensors are allocated on the heap via `malloc`/`jemalloc`. GPU tensors live in GPU virtual memory (a separate address space), allocated via `cudaMalloc`. Pinned host memory (`torch.cuda.pinned_memory`) uses `cudaMallocHost`, which allocates page-locked memory on the CPU heap that the GPU can access directly via DMA, enabling asynchronous transfers.

Stack Frames

Each function call creates a stack frame -- a region of the stack that holds the function's local variables, saved registers, and the return address. The stack grows downward (toward lower addresses):


High addresses
+--------------------+
| argument 7+ | <- overflow arguments (7th and beyond)
+--------------------+
| return address | <- pushed by CALL instruction (8 bytes)
+--------------------+
| saved RBP | <- callee pushes old frame pointer
+--------------------+ <- RBP points here (frame pointer)
| local variable 1 |
| local variable 2 |
| saved registers | <- callee-saved registers (RBX, R12-R15)
| alignment padding |
+--------------------+ <- RSP points here (stack pointer)
Low addresses

; ── Function prologue ──
push rbp ; Save caller's base pointer (RSP -= 8, [RSP] = RBP)
mov rbp, rsp ; Set up our base pointer (RBP = RSP)
sub rsp, 32 ; Allocate 32 bytes for locals (RSP -= 32)

; ── Function body ──
mov [rbp-8], rdi ; Store 1st argument in local variable
mov [rbp-16], rsi ; Store 2nd argument
mov rax, [rbp-8] ; Load local variable
add rax, [rbp-16] ; Add second local

; ── Function epilogue ──
mov rsp, rbp ; Deallocate locals (RSP = RBP)
pop rbp ; Restore caller's base pointer
ret ; Pop return address into RIP, jump back
; Equivalent shorthand: "leave" = mov rsp,rbp + pop rbp
**Frame pointer omission.** With `-fomit-frame-pointer` (default at `-O2`), the compiler does not save/restore RBP, freeing it as a general-purpose register. Instead, it accesses locals relative to RSP. This makes stack frames smaller and frees a register, but makes debugging harder because `gdb` cannot always walk the stack. Use `-fno-omit-frame-pointer` when profiling with `perf` or debugging crashes. **Debugging segfaults in C/C++ extensions.** When a PyTorch C++ extension crashes, `gdb`'s `backtrace` command walks the chain of saved RBP pointers to reconstruct the call stack. Each frame shows the function name, arguments, and source location. Common causes of stack-related crashes:
Crash TypeSymptomCommon Cause
Stack overflowSIGSEGV in deeply nested callUnbounded recursion, very large local arrays
Stack buffer overflowCorrupted return addressWriting past array bounds on stack
Use after returnReading garbage valuesReturning pointer to local variable
Misaligned stackSIGSEGV in SSE/AVX instructionStack not 16-byte aligned before CALL

Heap Allocation

The heap provides dynamic memory with programmer-controlled lifetime. Unlike the stack, heap allocations survive beyond the function that created them:


// Stack allocation: fast, automatic, limited size
int array[1000]; // On the stack (~8 KB)
// Freed automatically when function returns

// Heap allocation: slower, manual, virtually unlimited
int* array = malloc(1000 * sizeof(int)); // On the heap
// ... use array ...
free(array); // Must free manually (memory leak if forgotten)
array = NULL; // Defensive: prevent use-after-free

// C++ RAII: automatic cleanup via destructor
std::vector<int> vec(1000); // Heap-allocated, freed when vec goes out of scope

// PyTorch tensors: heap allocation via custom allocator
// torch::empty({1000}) calls c10::Allocator::raw_allocate()
// -> CPU: posix_memalign (64-byte aligned) or jemalloc
// -> CUDA: cudaMalloc (via caching allocator to amortize allocation cost)
PropertyStackHeap
Allocation cost~1 ns (move RSP)~100-1000 ns (allocator overhead)
DeallocationAutomatic (function return)Manual (free/delete) or GC
Size limit~8 MB per thread (configurable)Limited by available RAM
FragmentationNone (LIFO discipline)Yes (can waste memory over time)
Cache behaviorExcellent (recently used, contiguous)Variable (depends on allocation pattern)
Thread safetyPer-thread (no contention)Shared (allocator needs synchronization)
**PyTorch's caching allocator.** Raw `cudaMalloc`/`cudaFree` calls take 1-10 ms because they synchronize the GPU. PyTorch's CUDA caching allocator calls `cudaMalloc` once for a large block, then sub-allocates from it without GPU synchronization. This is why `torch.cuda.memory_allocated()` and `torch.cuda.memory_reserved()` show different values -- "reserved" is what was obtained from `cudaMalloc`, "allocated" is what tensors are actually using.

Virtual Memory

Virtual memory provides three critical properties: isolation (each process gets its own address space), abstraction (programs use flat addresses without knowing physical layout), and overcommit (programs can map more memory than physically available).

ConceptDescriptionML Implication
Page4 KB unit of virtual-to-physical mappingLarge tensors span many pages; TLB misses add latency
Huge pages2 MB or 1 GB pagesReduce TLB misses for large allocations; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Page faultAccess to unmapped page triggers OS handlerFirst touch on cudaMallocManaged memory triggers migration
TLBCache of virtual-to-physical translationsTLB misses cost ~100 cycles; huge pages reduce them
mmapMap files or anonymous memory into address spaceUsed by data loaders for memory-mapped datasets
Copy-on-writePages shared until one process writesfork() in DataLoader workers shares parent memory initially
**CPU vs GPU virtual memory.** The CPU and GPU have separate virtual address spaces. `cudaMemcpy` copies data between them via PCIe or NVLink. CUDA Unified Memory (`cudaMallocManaged`) creates pages that can be accessed by both CPU and GPU -- the CUDA runtime migrates pages on demand via page faults, which simplifies programming but adds latency from fault handling and migration.

Cache Lines and Alignment

The CPU does not read individual bytes from memory. It reads cache lines of 64 bytes. Every memory access brings an entire 64-byte cache line into L1 cache. This has profound implications for data layout:


// ── Aligned access: one cache line fetch ──
float array[16]; // 16 * 4 = 64 bytes = exactly one cache line
float val = array[0]; // Brings all 16 floats into cache

// ── Misaligned struct: wastes cache bandwidth ──
struct Bad {
char a; // 1 byte
// 7 bytes padding (align double to 8-byte boundary)
double b; // 8 bytes
char c; // 1 byte
// 7 bytes padding (align struct size to 8)
}; // Total: 24 bytes (only 10 bytes are data = 42% utilization)

// ── Packed struct: group fields by size ──
struct Good {
double b; // 8 bytes
char a; // 1 byte
char c; // 1 byte
// 6 bytes padding
}; // Total: 16 bytes (10 bytes data = 63% utilization)
PatternCache Lines TouchedEffective Bandwidth
Sequential array access1 per 16 floats100% (every byte used)
Stride-2 array access1 per 8 useful floats50% (half the bytes wasted)
Stride-16 array access1 per 1 useful float6% (60 of 64 bytes wasted)
Random access1 per element~6% (worst case)
Array of structs (AoS, padded)Depends on struct sizeOften 30-60%
Struct of arrays (SoA)1 per 16 floats per field~100% per field

// ── C: force alignment for SIMD ──
float data[256] __attribute__((aligned(64))); // 64-byte aligned (cache line)
float avx_data[256] __attribute__((aligned(32))); // 32-byte aligned (AVX)

// ── C11 standard alignment ──
#include <stdalign.h>
alignas(64) float data[256];

// ── C++17: aligned allocation ──
auto* ptr = std::aligned_alloc(64, 256 * sizeof(float));

// ── Python/NumPy: check alignment ──
import numpy as np
a = np.empty(256, dtype=np.float32)
print(a.ctypes.data % 64 == 0) # True (NumPy aligns to 64 bytes)

// ── PyTorch: tensors are 64-byte aligned by default ──
import torch
t = torch.empty(256)
print(t.data_ptr() % 64 == 0) # True
**Array of Structs (AoS) vs Struct of Arrays (SoA).** This is the most important data layout decision for performance:
// AoS: poor for SIMD, poor cache utilization if only accessing one field
struct Particle { float x, y, z, mass; };
Particle particles[1000];
// Accessing all x values: particles[0].x, particles[1].x, ...
// Memory layout: x0 y0 z0 m0 x1 y1 z1 m1 ...
// Stride-4 access pattern -> 25% cache utilization

// SoA: excellent for SIMD, excellent cache utilization per field
struct Particles {
float x[1000], y[1000], z[1000], mass[1000];
};
// Accessing all x values: sequential, one cache line per 16 floats
// Memory layout: x0 x1 x2 ... x999 y0 y1 y2 ...

PyTorch tensors use SoA-style contiguous storage, which is why contiguous tensors are faster than non-contiguous views.

**False sharing.** When two threads write to different variables that happen to share the same cache line, the CPU's cache coherence protocol forces the line to bounce between cores. This can reduce multithreaded performance by 10-100x:
// BAD: counters[0] and counters[1] share a cache line
int counters[NUM_THREADS]; // 4 bytes apart -> same cache line
// Each thread writes counters[thread_id] -> false sharing

// GOOD: pad to cache line boundaries
struct alignas(64) PaddedCounter { int value; };
PaddedCounter counters[NUM_THREADS]; // 64 bytes apart -> separate cache lines