Memory and Stack
Every running program uses memory organized into distinct regions with different lifetimes, access speeds, and purposes. Understanding this layout -- the stack for function calls, the heap for dynamic allocations, virtual memory for isolation, and cache lines for performance -- is essential for debugging crashes in C++/CUDA extensions, understanding why PyTorch allocators exist, and writing code that cooperates with the memory hierarchy rather than fighting it.
Process Memory Layout
Every process sees a flat 64-bit virtual address space. The OS maps this to physical memory through page tables, giving each process the illusion of owning all of memory:
High addresses (0x00007FFFFFFFFFFF)
+---------------------------+
| Stack | <- grows downward (toward lower addresses)
| (function frames, | RSP points to current top
| local variables) | Default limit: 8 MB per thread
+---------------------------+
| | |
| v |
| (unmapped gap) |
| ^ |
| | |
+---------------------------+
| Memory-mapped region | <- mmap, shared libraries, file mappings
+---------------------------+
| ^ |
| | |
| Heap | <- grows upward (toward higher addresses)
| (malloc, new, PyTorch | managed by allocator (glibc, jemalloc, etc.)
| CPU tensor storage) |
+---------------------------+
| BSS (.bss) | <- uninitialized globals (zero-filled)
| Data (.data) | <- initialized globals
| Read-only data (.rodata) | <- string literals, constants
| Code (.text) | <- executable instructions
+---------------------------+
| (unmapped) | <- NULL dereference caught here (SIGSEGV)
Low addresses (0x0000000000000000)
Stack Frames
Each function call creates a stack frame -- a region of the stack that holds the function's local variables, saved registers, and the return address. The stack grows downward (toward lower addresses):
High addresses
+--------------------+
| argument 7+ | <- overflow arguments (7th and beyond)
+--------------------+
| return address | <- pushed by CALL instruction (8 bytes)
+--------------------+
| saved RBP | <- callee pushes old frame pointer
+--------------------+ <- RBP points here (frame pointer)
| local variable 1 |
| local variable 2 |
| saved registers | <- callee-saved registers (RBX, R12-R15)
| alignment padding |
+--------------------+ <- RSP points here (stack pointer)
Low addresses
; ── Function prologue ──
push rbp ; Save caller's base pointer (RSP -= 8, [RSP] = RBP)
mov rbp, rsp ; Set up our base pointer (RBP = RSP)
sub rsp, 32 ; Allocate 32 bytes for locals (RSP -= 32)
; ── Function body ──
mov [rbp-8], rdi ; Store 1st argument in local variable
mov [rbp-16], rsi ; Store 2nd argument
mov rax, [rbp-8] ; Load local variable
add rax, [rbp-16] ; Add second local
; ── Function epilogue ──
mov rsp, rbp ; Deallocate locals (RSP = RBP)
pop rbp ; Restore caller's base pointer
ret ; Pop return address into RIP, jump back
; Equivalent shorthand: "leave" = mov rsp,rbp + pop rbp
| Crash Type | Symptom | Common Cause |
|---|---|---|
| Stack overflow | SIGSEGV in deeply nested call | Unbounded recursion, very large local arrays |
| Stack buffer overflow | Corrupted return address | Writing past array bounds on stack |
| Use after return | Reading garbage values | Returning pointer to local variable |
| Misaligned stack | SIGSEGV in SSE/AVX instruction | Stack not 16-byte aligned before CALL |
Heap Allocation
The heap provides dynamic memory with programmer-controlled lifetime. Unlike the stack, heap allocations survive beyond the function that created them:
// Stack allocation: fast, automatic, limited size
int array[1000]; // On the stack (~8 KB)
// Freed automatically when function returns
// Heap allocation: slower, manual, virtually unlimited
int* array = malloc(1000 * sizeof(int)); // On the heap
// ... use array ...
free(array); // Must free manually (memory leak if forgotten)
array = NULL; // Defensive: prevent use-after-free
// C++ RAII: automatic cleanup via destructor
std::vector<int> vec(1000); // Heap-allocated, freed when vec goes out of scope
// PyTorch tensors: heap allocation via custom allocator
// torch::empty({1000}) calls c10::Allocator::raw_allocate()
// -> CPU: posix_memalign (64-byte aligned) or jemalloc
// -> CUDA: cudaMalloc (via caching allocator to amortize allocation cost)
| Property | Stack | Heap |
|---|---|---|
| Allocation cost | ~1 ns (move RSP) | ~100-1000 ns (allocator overhead) |
| Deallocation | Automatic (function return) | Manual (free/delete) or GC |
| Size limit | ~8 MB per thread (configurable) | Limited by available RAM |
| Fragmentation | None (LIFO discipline) | Yes (can waste memory over time) |
| Cache behavior | Excellent (recently used, contiguous) | Variable (depends on allocation pattern) |
| Thread safety | Per-thread (no contention) | Shared (allocator needs synchronization) |
Virtual Memory
Virtual memory provides three critical properties: isolation (each process gets its own address space), abstraction (programs use flat addresses without knowing physical layout), and overcommit (programs can map more memory than physically available).
| Concept | Description | ML Implication |
|---|---|---|
| Page | 4 KB unit of virtual-to-physical mapping | Large tensors span many pages; TLB misses add latency |
| Huge pages | 2 MB or 1 GB pages | Reduce TLB misses for large allocations; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
| Page fault | Access to unmapped page triggers OS handler | First touch on cudaMallocManaged memory triggers migration |
| TLB | Cache of virtual-to-physical translations | TLB misses cost ~100 cycles; huge pages reduce them |
| mmap | Map files or anonymous memory into address space | Used by data loaders for memory-mapped datasets |
| Copy-on-write | Pages shared until one process writes | fork() in DataLoader workers shares parent memory initially |
Cache Lines and Alignment
The CPU does not read individual bytes from memory. It reads cache lines of 64 bytes. Every memory access brings an entire 64-byte cache line into L1 cache. This has profound implications for data layout:
// ── Aligned access: one cache line fetch ──
float array[16]; // 16 * 4 = 64 bytes = exactly one cache line
float val = array[0]; // Brings all 16 floats into cache
// ── Misaligned struct: wastes cache bandwidth ──
struct Bad {
char a; // 1 byte
// 7 bytes padding (align double to 8-byte boundary)
double b; // 8 bytes
char c; // 1 byte
// 7 bytes padding (align struct size to 8)
}; // Total: 24 bytes (only 10 bytes are data = 42% utilization)
// ── Packed struct: group fields by size ──
struct Good {
double b; // 8 bytes
char a; // 1 byte
char c; // 1 byte
// 6 bytes padding
}; // Total: 16 bytes (10 bytes data = 63% utilization)
| Pattern | Cache Lines Touched | Effective Bandwidth |
|---|---|---|
| Sequential array access | 1 per 16 floats | 100% (every byte used) |
| Stride-2 array access | 1 per 8 useful floats | 50% (half the bytes wasted) |
| Stride-16 array access | 1 per 1 useful float | 6% (60 of 64 bytes wasted) |
| Random access | 1 per element | ~6% (worst case) |
| Array of structs (AoS, padded) | Depends on struct size | Often 30-60% |
| Struct of arrays (SoA) | 1 per 16 floats per field | ~100% per field |
// ── C: force alignment for SIMD ──
float data[256] __attribute__((aligned(64))); // 64-byte aligned (cache line)
float avx_data[256] __attribute__((aligned(32))); // 32-byte aligned (AVX)
// ── C11 standard alignment ──
#include <stdalign.h>
alignas(64) float data[256];
// ── C++17: aligned allocation ──
auto* ptr = std::aligned_alloc(64, 256 * sizeof(float));
// ── Python/NumPy: check alignment ──
import numpy as np
a = np.empty(256, dtype=np.float32)
print(a.ctypes.data % 64 == 0) # True (NumPy aligns to 64 bytes)
// ── PyTorch: tensors are 64-byte aligned by default ──
import torch
t = torch.empty(256)
print(t.data_ptr() % 64 == 0) # True
// AoS: poor for SIMD, poor cache utilization if only accessing one field
struct Particle { float x, y, z, mass; };
Particle particles[1000];
// Accessing all x values: particles[0].x, particles[1].x, ...
// Memory layout: x0 y0 z0 m0 x1 y1 z1 m1 ...
// Stride-4 access pattern -> 25% cache utilization
// SoA: excellent for SIMD, excellent cache utilization per field
struct Particles {
float x[1000], y[1000], z[1000], mass[1000];
};
// Accessing all x values: sequential, one cache line per 16 floats
// Memory layout: x0 x1 x2 ... x999 y0 y1 y2 ...
PyTorch tensors use SoA-style contiguous storage, which is why contiguous tensors are faster than non-contiguous views.
// BAD: counters[0] and counters[1] share a cache line
int counters[NUM_THREADS]; // 4 bytes apart -> same cache line
// Each thread writes counters[thread_id] -> false sharing
// GOOD: pad to cache line boundaries
struct alignas(64) PaddedCounter { int value; };
PaddedCounter counters[NUM_THREADS]; // 64 bytes apart -> separate cache lines