Debugging
Debugging deep learning code is uniquely challenging. Unlike traditional software bugs that produce immediate errors, ML bugs often manifest as silently wrong results: the model trains, produces numbers, but those numbers are incorrect. Training might converge to a suboptimal loss, or the model might fail to generalize. This chapter covers the most common bugs, systematic debugging strategies, and the tools PyTorch provides for diagnosing issues.
The Debugging Hierarchy
When something goes wrong with training, work through this checklist in order:
| Priority | Category | Examples | How to Detect |
|---|---|---|---|
| 1 | Shape errors | Wrong broadcasting, incorrect transpose | Assertions, explicit shape checks |
| 2 | Device errors | CPU/GPU mismatch | RuntimeError messages |
| 3 | Data bugs | Wrong labels, incorrect normalization, data leakage | Inspect random samples, check label distribution |
| 4 | Numerical issues | NaN/Inf, gradient explosion/vanishing | Monitor loss, grad norms, detect_anomaly() |
| 5 | Hyperparameter issues | LR too high/low, wrong schedule | Loss curves, learning rate sweeps |
| 6 | Architecture bugs | Wrong activation, missing layer, incorrect connection | Overfit a single batch first |
| 7 | Training bugs | Not zeroing grads, eval() not called, wrong loss | Code review, unit tests |
Anomaly Detection
PyTorch can track where NaN/Inf values originate by recording the operation that produced each tensor:
# Enable anomaly detection (adds significant overhead, debug only)
torch.autograd.set_detect_anomaly(True)
# As a context manager (scoped to specific code)
with torch.autograd.detect_anomaly():
output = model(input)
loss = criterion(output, target)
loss.backward()
# If any operation produces NaN, PyTorch will raise an error
# with the EXACT operation name and a traceback to where it was created
# Example error output:
# RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
# The above operation was created at:
# File "model.py", line 42, in forward
# return F.log_softmax(logits, dim=-1)
Use it only to locate the source of NaN/Inf. Once found, disable it and fix the underlying issue. Do not leave it enabled in production training.
Shape Debugging
Shape mismatches are the most common bug in deep learning code. They often produce incorrect results without raising an error (due to broadcasting):
class MyAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0, f"d_model ({d_model}) must be divisible by n_heads ({n_heads})"
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.q_proj = nn.Linear(d_model, d_model)
self.k_proj = nn.Linear(d_model, d_model)
self.v_proj = nn.Linear(d_model, d_model)
def forward(self, x):
B, T, D = x.shape
assert D == self.d_model, f"Expected d_model={self.d_model}, got {D}"
q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)
# Reshape for multi-head attention
# (B, T, D) -> (B, T, H, D/H) -> (B, H, T, D/H)
q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
assert q.shape == (B, self.n_heads, T, self.head_dim), \
f"q shape mismatch: expected {(B, self.n_heads, T, self.head_dim)}, got {q.shape}"
attn = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
assert attn.shape == (B, self.n_heads, T, T), \
f"attn shape mismatch: expected {(B, self.n_heads, T, T)}, got {attn.shape}"
return attn
def debug_tensor(name, tensor):
"""Print comprehensive tensor diagnostics."""
stats = {
'shape': tensor.shape,
'dtype': tensor.dtype,
'device': tensor.device,
'min': tensor.min().item(),
'max': tensor.max().item(),
'mean': tensor.float().mean().item(),
'std': tensor.float().std().item(),
'nan': tensor.isnan().sum().item(),
'inf': tensor.isinf().sum().item(),
'requires_grad': tensor.requires_grad,
}
parts = [f"{name}:"]
parts.append(f" shape={stats['shape']}, dtype={stats['dtype']}, device={stats['device']}")
parts.append(f" range=[{stats['min']:.4f}, {stats['max']:.4f}], "
f"mean={stats['mean']:.4f}, std={stats['std']:.4f}")
if stats['nan'] > 0 or stats['inf'] > 0:
parts.append(f" WARNING: {stats['nan']} NaN, {stats['inf']} Inf values!")
print('\n'.join(parts))
# Usage in forward pass
def forward(self, x):
debug_tensor("input", x)
x = self.encoder(x)
debug_tensor("after_encoder", x)
x = self.classifier(x)
debug_tensor("logits", x)
return x
NaN Detection and Prevention
def check_for_nan(model, loss, step):
"""Check for NaN in loss, parameters, and gradients."""
issues = []
if torch.isnan(loss):
issues.append(f"NaN loss at step {step}")
if torch.isinf(loss):
issues.append(f"Inf loss at step {step} (value: {loss.item()})")
for name, param in model.named_parameters():
if torch.isnan(param).any():
issues.append(f"NaN in parameter: {name}")
if param.grad is not None:
if torch.isnan(param.grad).any():
issues.append(f"NaN in gradient: {name}")
if torch.isinf(param.grad).any():
issues.append(f"Inf in gradient: {name} (norm: {param.grad.norm():.2e})")
if issues:
for issue in issues:
print(f" [NaN CHECK] {issue}")
raise ValueError(f"Numerical issues detected at step {step}")
# Integrate into training loop
for step, (data, target) in enumerate(loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
# Check BEFORE backward (catches forward-pass NaN)
if step % 10 == 0:
check_for_nan(model, loss, step)
loss.backward()
# Check AFTER backward (catches backward-pass NaN)
if step % 10 == 0:
for name, param in model.named_parameters():
if param.grad is not None and torch.isnan(param.grad).any():
print(f"NaN gradient in {name} at step {step}")
| Cause | How It Happens | Symptom | Fix |
|---|---|---|---|
| LR too high | Large weight update overshoots | Loss spikes then NaN | Reduce LR; add warmup |
| Division by zero | 1/x where x can be 0 | NaN in specific layer | Add epsilon: x / (y + 1e-8) |
| Log of zero/negative | log(prob) where prob <= 0 | NaN in loss | Clamp: torch.log(x.clamp(min=1e-8)) |
| Exp overflow | exp(x) where x > 88 (FP32) | Inf then NaN | Use log-space: log_softmax instead of softmax + log |
| FP16 overflow | Gradients > 65504 | NaN with mixed precision | Use BF16, or increase GradScaler initial scale |
| FP16 underflow | Gradients below the smallest normal () lose precision in the subnormal range; full flush-to-zero near | Gradients become 0, loss plateaus | Use BF16, or GradScaler |
| Unstable normalization | LayerNorm/BatchNorm with zero variance | NaN after norm layer | Increase epsilon in norm layers |
| Attention score overflow | produces large values | NaN after softmax | Use scaled dot-product attention, or clamp scores |
| Incorrect loss function | Using loss that expects different input | Persistent NaN | Verify loss function API (log-probs vs logits) |
# BAD: log can produce -inf or NaN
loss = -torch.log(probs)
# GOOD: clamp to prevent log(0)
loss = -torch.log(probs.clamp(min=1e-8))
# BETTER: use log_softmax (numerically stable log + softmax)
loss = F.cross_entropy(logits, targets) # Uses log_softmax internally
# BAD: softmax then log (two passes, less stable)
probs = F.softmax(logits, dim=-1)
loss = -torch.log(probs[range(B), targets])
# GOOD: log_softmax (one pass, numerically stable)
log_probs = F.log_softmax(logits, dim=-1)
loss = -log_probs[range(B), targets].mean()
# BAD: division can produce Inf
normalized = x / x.norm()
# GOOD: add epsilon to denominator
normalized = x / (x.norm() + 1e-8)
# BEST: use built-in normalized function
normalized = F.normalize(x, dim=-1) # Handles zero-norm internally
Symptom. A transformer trains fine for a few hundred steps, then the loss prints nan and never recovers. There is no Python exception: the model keeps running, but every subsequent loss is nan.
Step 1: localize the operation. Wrap one step in anomaly detection so PyTorch reports the exact backward op and the line where the offending tensor was created:
with torch.autograd.detect_anomaly():
output = model(input)
loss = criterion(output, target)
loss.backward()
The error points at LogBackward0 created inside a custom loss that computes -torch.log(probs).
Step 2: what is the underlying cause?
The custom loss takes log of a probability that can reach exactly 0 (for example, after an aggressive softmax saturates one logit). log(0) is -inf, and -inf propagated through the backward pass becomes nan in the gradients, which then corrupts the weights so that every later step produces nan. The forward loss looked finite at first only because the probability had not yet collapsed to zero.
Step 3: what is the fix?
Replace the manual log of a probability with a numerically stable formulation. Either clamp the argument, or better, fuse the log into the softmax with log_softmax / cross_entropy:
# Before: -torch.log(probs) can hit log(0) = -inf
# After: let PyTorch fuse log and softmax for stability
loss = F.cross_entropy(logits, targets)
After the change, rerun the overfit-one-batch test (below) to confirm the loss now descends smoothly instead of diverging to nan.
This is the canonical loop: anomaly detection turns a silent nan into a named operation and source line (the diagnosis), the NaN-causes table maps that operation to a root cause, and the defensive-coding patterns supply the fix.
Memory Debugging
import torch
# Basic memory reporting
def print_memory():
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print(f"Max Alloc: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
# Track peak memory for a code block
torch.cuda.reset_peak_memory_stats()
# ... your code ...
peak = torch.cuda.max_memory_allocated() / 1e9
print(f"Peak memory: {peak:.2f} GB")
# Memory snapshot (PyTorch 2.1+) to visualize memory over time
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... run training steps ...
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
torch.cuda.memory._record_memory_history(enabled=None) # Stop recording
# Visualize at: https://pytorch.org/memory_viz
| Component | Memory | Formula | Typical for 7B Model |
|---|---|---|---|
| Model parameters | bytes (FP32) or (BF16) | (see formula) | 14 GB (BF16) |
| Gradients | bytes (FP32) or (BF16) | Same as params | 14 GB (BF16) |
| Optimizer state (Adam) | bytes (m + v, FP32) | 2x parameters | 56 GB |
| Activations (for backward) | Depends on batch size and model | 10-50 GB | |
| Total | (sum of rows) | ~ + activations | 94+ GB |
# LEAK 1: Storing loss tensors (keeps entire computation graph in GPU memory)
losses = []
for batch in loader:
loss = model(batch)
losses.append(loss) # BAD: keeps graph alive
# FIX: losses.append(loss.item()) # .item() extracts Python scalar
# LEAK 2: Not clearing CUDA cache after large allocations
big_tensor = torch.randn(10000, 10000, device='cuda')
del big_tensor
# Memory is returned to PyTorch's cache but NOT to the OS
torch.cuda.empty_cache() # Returns memory to CUDA driver (rarely needed)
# LEAK 3: Holding references to intermediate tensors
class BadModel(nn.Module):
def forward(self, x):
self.last_hidden = self.encoder(x) # BAD: stored as attribute
return self.decoder(self.last_hidden)
# FIX: Don't store intermediates, or detach them:
# self.last_hidden = self.encoder(x).detach()
# LEAK 4: Forgetting to delete tensors in evaluation
@torch.no_grad()
def evaluate(model, loader):
all_preds = []
for batch in loader:
output = model(batch.cuda())
all_preds.append(output.cpu()) # Move to CPU to free GPU memory
return torch.cat(all_preds)
The "Overfit One Batch" Test
The single most useful debugging technique: verify that your model can perfectly fit a single batch of training data.
def overfit_one_batch(model, dataset, device='cuda', steps=1000, lr=1e-3):
"""Verify model can memorize a single batch.
If this fails, the model or training code has a bug."""
model = model.to(device).train()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
# Get one batch
loader = DataLoader(dataset, batch_size=32, shuffle=False)
data, target = next(iter(loader))
data, target = data.to(device), target.to(device)
for step in range(steps):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if step % 100 == 0:
acc = (output.argmax(1) == target).float().mean()
print(f"Step {step}: loss={loss.item():.4f}, acc={acc:.2%}")
# Should reach near-zero loss and ~100% accuracy
final_acc = (model(data).argmax(1) == target).float().mean()
assert final_acc > 0.99, f"Failed to overfit one batch! Final acc: {final_acc:.2%}"
print("PASSED: model can overfit one batch")
# Run this BEFORE full training to catch bugs early
overfit_one_batch(model, train_dataset)
| Symptom | Likely Cause |
|---|---|
| Loss does not decrease at all | Gradients not flowing (detach, wrong loss, frozen params) |
| Loss decreases but accuracy stays at random | Wrong loss function for the task |
| Loss goes to NaN immediately | Numerical issue (see NaN table above) |
| Loss decreases very slowly | LR too low, or model too small for the task |
| Loss oscillates wildly | LR too high |
| Accuracy saturates below 100% | Architecture cannot represent the mapping (too few params, wrong activation) |
Reproducibility
import torch
import numpy as np
import random
def set_seed(seed=42):
"""Set all random seeds for reproducibility."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Make CuDNN deterministic (may reduce performance)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Enable deterministic algorithms (PyTorch 2.0+)
torch.use_deterministic_algorithms(True)
# Note: some operations do not have deterministic implementations
# and will raise an error. Set CUBLAS_WORKSPACE_CONFIG=:4096:8
set_seed(42)
| Source | How to Control | Performance Impact |
|---|---|---|
Python random | random.seed(seed) | None |
| NumPy | np.random.seed(seed) | None |
| PyTorch CPU | torch.manual_seed(seed) | None |
| PyTorch GPU | torch.cuda.manual_seed_all(seed) | None |
| CuDNN autotuner | torch.backends.cudnn.benchmark = False | Up to 2x slower (first few iterations) |
| CuDNN algorithms | torch.backends.cudnn.deterministic = True | Slight slowdown |
| Nondeterministic ops | torch.use_deterministic_algorithms(True) | Some ops slower or error |
| Data loading order | DataLoader(shuffle=False) or seeded generator | None |
| Multi-worker loading | worker_init_fn with deterministic seeds | None |
| Dropout | Controlled by PyTorch seed | None |
| Data augmentation | Controlled by NumPy/random seed | None |
In practice: Full determinism is useful for debugging but unnecessary (and slower) for production training. Use it to verify that a code change does not break training, then disable for actual runs.
Debugging Checklist
| Step | Action | Catches |
|---|---|---|
| 1 | Overfit one batch | Model/code bugs, wrong loss function |
| 2 | Check input data (visualize samples, verify labels) | Data pipeline bugs, wrong preprocessing |
| 3 | Verify shapes at each layer | Broadcasting bugs, wrong dimensions |
| 4 | Monitor loss, gradient norms, LR per step | Training instability, bad hyperparams |
| 5 | Compare against a known baseline | Architecture bugs, missing components |
| 6 | Train on a small subset first | Verify learning before spending compute |
| 7 | Check train vs val gap | Overfitting or underfitting |
| 8 | Profile with PyTorch profiler | Performance bottlenecks |