Tensors
Tensors are the fundamental data structure of PyTorch and deep learning. A tensor is a multi-dimensional array, a generalization of scalars (0D), vectors (1D), matrices (2D), and higher-dimensional arrays (3D+). Every operation in a neural network, from storing weights to computing gradients, operates on tensors. Understanding tensor creation, memory layout, dtypes, and broadcasting is essential for writing correct and efficient PyTorch code.
Creating Tensors
import torch
# From Python data
x = torch.tensor([1.0, 2.0, 3.0]) # 1D tensor from list
X = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) # 2D tensor with explicit dtype
# Factory functions (create tensors with specific patterns)
zeros = torch.zeros(3, 4) # All zeros, shape (3, 4)
ones = torch.ones(3, 4) # All ones, shape (3, 4)
rand = torch.randn(3, 4) # Standard normal N(0,1)
rand_uniform = torch.rand(3, 4) # Uniform [0, 1)
arange = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
eye = torch.eye(3) # 3x3 identity matrix
empty = torch.empty(3, 4) # Uninitialized (fast, but contains garbage)
# "Like" functions: match dtype, device, and layout of an existing tensor
x_like = torch.zeros_like(X) # Same shape, dtype, device as X
x_new = X.new_zeros(5, 5) # Same dtype and device, different shape
x_rand = torch.randn_like(X) # Same shape, random values
# From NumPy (shares memory, zero-copy)
import numpy as np
np_array = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(np_array) # Shares memory with np_array
np_back = t.numpy() # Back to NumPy (only works on CPU tensors)
| Function | Distribution | When to Use |
|---|---|---|
torch.randn(...) | Starting point; scale by desired std | |
torch.empty(...).uniform_(-a, a) | Kaiming uniform (default for nn.Linear) | |
nn.init.kaiming_normal_(w) | ReLU layers | |
nn.init.xavier_normal_(w) | Sigmoid/Tanh layers | |
torch.zeros(...) | Constant 0 | Biases, residual branch output |
Using torch.empty() is fastest (no initialization), but the tensor contains whatever was previously in that memory, so always initialize before use.
Data Types (dtypes)
| Dtype | Bits | Significand | Exponent | Range | Use Case |
|---|---|---|---|---|---|
torch.float32 (float) | 32 | 23 bits (~7 digits) | 8 bits | Default for training; master weights | |
torch.float16 (half) | 16 | 10 bits (~3.3 digits) | 5 bits | Mixed precision (requires loss scaling) | |
torch.bfloat16 | 16 | 7 bits (~2.4 digits) | 8 bits | Preferred for LLM training (no scaling needed) | |
torch.float64 (double) | 64 | 52 bits (~15 digits) | 11 bits | Numerical testing, scientific computing | |
torch.float8_e4m3fn | 8 | 3 bits | 4 bits | H100 inference/training (FP8) | |
torch.float8_e5m2 | 8 | 2 bits | 5 bits | FP8 gradients (wider range) | |
torch.int64 (long) | 64 | n/a | n/a | Indices, token IDs, labels | |
torch.int32 (int) | 32 | n/a | n/a | Indices (when int64 is wasteful) | |
torch.int8 | 8 | n/a | n/a | to | Quantized inference |
torch.bool | 8 | n/a | n/a | True/False | Attention masks, conditions |
x = torch.randn(3, 3) # Default: float32
# Convert dtype (creates a copy with the new type)
x_half = x.half() # float32 -> float16
x_bf16 = x.bfloat16() # float32 -> bfloat16
x_back = x_half.float() # float16 -> float32 (does NOT recover lost precision)
# In-place dtype specification at creation
x = torch.randn(3, 3, dtype=torch.bfloat16) # Created directly in bfloat16
# Mixed precision: autocast handles conversions automatically
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
# GEMMs run in bfloat16 (Tensor Cores), reductions in float32
output = model(input)
| Property | FP16 | BF16 |
|---|---|---|
| Precision | ~3.3 decimal digits | ~2.4 decimal digits |
| Dynamic range | (same as FP32) | |
| Overflow risk | High (gradients > 65504 overflow to Inf) | Essentially none |
| Requires loss scaling | Yes (GradScaler) | No |
| Tensor Core support | All Tensor Core GPUs | Ampere (A100) and newer |
BF16 trades precision for range. Since training gradients can span many orders of magnitude, the wider range of BF16 avoids the gradient overflow problem that makes FP16 training fragile. The slight precision loss (~1 decimal digit) rarely affects convergence.
Rule of thumb: Use BF16 on Ampere+ GPUs. Use FP16 with GradScaler only on V100 (which lacks BF16 Tensor Cores).
Device Placement
# Create directly on GPU (preferred: avoids CPU -> GPU copy)
x = torch.randn(3, 3, device='cuda') # Default GPU (cuda:0)
x = torch.randn(3, 3, device='cuda:1') # Specific GPU
# Move existing tensor to GPU (copies data)
x_cpu = torch.randn(3, 3)
x_gpu = x_cpu.cuda() # .cuda() method
x_gpu = x_cpu.to('cuda:0') # .to() method (more flexible)
# Move back to CPU
x_cpu = x_gpu.cpu()
# Check device
print(x_gpu.device) # cuda:0
print(x_gpu.is_cuda) # True
# Device-agnostic code pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn(3, 3, device=device)
model = model.to(device)
# This raises RuntimeError: Expected all tensors to be on the same device
cpu_tensor = torch.randn(3)
gpu_tensor = torch.randn(3, device='cuda')
result = cpu_tensor + gpu_tensor # ERROR
# Fix: move to the same device
result = cpu_tensor.to(gpu_tensor.device) + gpu_tensor
When debugging device errors, check .device on all tensors involved in the operation.
# Allocate pinned memory (stays in physical RAM, faster GPU transfers)
x = torch.randn(1024, 1024, pin_memory=True)
x_gpu = x.cuda(non_blocking=True) # Async copy, overlaps with compute
# DataLoader with pinned memory (the most common use)
loader = DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=4)
for batch in loader:
batch = batch.cuda(non_blocking=True) # Overlaps with next batch loading
pin_memory=True in DataLoader + .cuda(non_blocking=True) is the standard pattern for overlapping data loading with GPU computation. The speedup is typically 10-30%.
Memory Layout: Strides and Contiguity
Under the hood, every tensor is a view into a contiguous block of memory (the storage). The tensor's shape, strides, and offset determine which elements of the storage it accesses:
x = torch.tensor([[1, 2, 3],
[4, 5, 6]])
print(x.shape) # torch.Size([2, 3])
print(x.stride()) # (3, 1): row 0 to row 1 jumps 3 elements; col to col jumps 1
print(x.storage_offset()) # 0
print(x.is_contiguous()) # True
# Memory layout (row-major / C-order):
# Storage: [1, 2, 3, 4, 5, 6]
# ^ ^
# row 0 row 1
# Transpose changes strides, NOT data
y = x.T
print(y.shape) # torch.Size([3, 2])
print(y.stride()) # (1, 3): strides are swapped
print(y.is_contiguous()) # False (stride[0] < stride[1])
# y still points to the SAME storage: [1, 2, 3, 4, 5, 6]
# But y[0] = [1, 4] (stride 3), y[1] = [2, 5] (stride 3)
# Slicing creates a view with offset and modified strides
z = x[:, 1:] # Columns 1 and 2
print(z.stride()) # (3, 1): inherited from x, not the (2, 1) a fresh (2, 2) tensor would have
print(z.storage_offset()) # 1: starts at element 1 in storage
print(z.is_contiguous()) # False (contiguous (2, 2) needs stride (2, 1); stride[i] must equal the product of sizes after dim i)
Views vs Copies
A view shares memory with the original tensor: modifying one modifies the other. A copy allocates new memory.
x = torch.randn(4, 4)
# VIEW operations (no data copy, shared memory)
y = x.view(2, 8) # Reshape (requires contiguous input)
y = x.reshape(2, 8) # Reshape (returns view if possible, copies if not)
y = x[0:2] # Slice
y = x.T # Transpose
y = x.unsqueeze(0) # Add dimension: (4,4) -> (1,4,4)
y = x.squeeze() # Remove size-1 dimensions
y = x.expand(3, 4, 4) # Broadcast to larger size (no copy)
y = x.permute(1, 0) # Reorder dimensions
# COPY operations (allocate new memory)
y = x.clone() # Explicit copy (preserves grad_fn)
y = x.detach().clone() # Copy without autograd history
y = x.to(torch.float16) # Dtype conversion always copies
y = x.to('cuda') # Device transfer always copies
# MAY-COPY operation (copies only when needed)
y = x.contiguous() # Copies if non-contiguous; no-op (returns self) if already contiguous
# Demonstrate that views share memory
x = torch.randn(4, 4)
y = x.view(2, 8)
y[0, 0] = 999.0
print(x[0, 0]) # 999.0: y is a view, so modifying y modifies x
# This is a feature, not a bug: views are essential for memory efficiency.
# But it can cause subtle bugs if you modify a view unexpectedly.
# Safe pattern: clone if you need independence
y = x.view(2, 8).clone()
y[0, 0] = 999.0
print(x[0, 0]) # Not 999: y is an independent copy
This example traces how shape, stride, and offset map a logical index to a physical storage location. The address formula is offset + sum(index[i] * stride[i]). The first lookup is fully worked; fill in the reasoning for the rest.
x = torch.arange(6).reshape(2, 3)
# Storage: [0, 1, 2, 3, 4, 5], shape (2, 3), stride (3, 1), offset 0
# Worked: x[1, 2] lives at offset + 1*stride[0] + 2*stride[1]
# = 0 + 1*3 + 2*1 = 5 -> storage[5] = 5
print(x[1, 2]) # 5
y = x.T
# y has shape (3, 2), stride (1, 3), offset 0 (same storage)
# Your turn: y[2, 0] lives at 0 + 2*stride[0] + 0*stride[1]
# = 0 + 2*1 + 0*3 = 2 -> storage[2] = 2
print(y[2, 0]) # 2
z = x[:, 1:]
# z has shape (2, 2), stride (3, 1), offset 1
# Your turn: z[1, 1] lives at 1 + 1*stride[0] + 1*stride[1]
# = 1 + 1*3 + 1*1 = 5 -> storage[5] = 5
print(z[1, 1]) # 5
# Why z is non-contiguous: a fresh contiguous (2, 2) needs stride (2, 1),
# but z inherited stride (3, 1) from x, so its rows skip a storage slot.
print(z.is_contiguous()) # False
In practice, reshape() is the safest default. Use view() when you want the error if the tensor is unexpectedly non-contiguous (a sign of a bug).
Broadcasting
Broadcasting automatically expands tensor dimensions to make shapes compatible for elementwise operations. It is the mechanism behind seemingly impossible operations like adding a vector to a matrix:
# Broadcasting rules (applied right-to-left):
# 1. Align shapes from the right
# 2. For each dimension: sizes must be equal, OR one of them must be 1
# 3. Size-1 dimensions are "stretched" to match the other tensor
# Example: adding a bias vector to a batch of features
features = torch.randn(32, 10) # Shape: (32, 10), 32 samples, 10 features
bias = torch.randn(10) # Shape: (10,) -> broadcasted to (1, 10) -> (32, 10)
result = features + bias # Shape: (32, 10), bias added to each sample
# Example: batch outer product
a = torch.randn(32, 5, 1) # (32, 5, 1)
b = torch.randn(32, 1, 7) # (32, 1, 7)
outer = a * b # (32, 5, 7): outer product per batch element
# Example: attention mask broadcasting
# Q*K^T: (B, H, T, T) B=batch, H=heads, T=sequence length
# Mask: (B, 1, 1, T) broadcasts across H and query positions
scores = torch.randn(4, 8, 64, 64) # (B, H, T, T)
mask = torch.ones(4, 1, 1, 64) # (B, 1, 1, T): broadcasts
masked_scores = scores + mask # (4, 8, 64, 64)
# Common broadcasting shapes in ML:
# (B, T, D) + (D,) -> per-feature bias
# (B, T, D) * (B, T, 1) -> per-position scaling (gating)
# (B, H, T, T) + (1, 1, T, T) -> position-independent attention bias
# Intended: subtract a per-sample target from a per-sample prediction, element-wise
pred = torch.randn(32, 1) # (32, 1): one prediction per sample, kept as a column
target = torch.randn(32) # (32,): one target per sample
# This does NOT error! Right-aligned, (32, 1) and (32,) align as (32, 1) vs (1, 32),
# so broadcasting silently expands both to (32, 32) instead of (32, 1).
loss = (pred - target.unsqueeze(1)) ** 2 # Correct: target -> (32, 1), result is (32, 1)
loss = (pred - target) ** 2 # WRONG: silently broadcasts to (32, 32)!
Defense: Always assert shapes at module boundaries:
assert logits.shape == (batch_size, num_classes), f"Got {logits.shape}"
Essential Tensor Operations
| Category | Operations | Notes |
|---|---|---|
| Arithmetic | +, -, *, /, **, @ (matmul) | Elementwise except @ |
| Reduction | sum, mean, max, min, prod, norm | Specify dim to reduce along an axis |
| Comparison | ==, !=, >, <, ge, le | Returns bool tensor |
| Indexing | [], index_select, gather, scatter_ | gather/scatter_ for advanced indexing |
| Shape | view, reshape, permute, transpose, squeeze, unsqueeze, expand | Views when possible |
| Concatenation | cat (along existing dim), stack (new dim) | cat does not add a dimension |
| Linear algebra | matmul, mm, bmm, svd, eig, solve | bmm for batched matmul |
| In-place | add_, mul_, zero_, fill_, copy_ | Trailing _ = in-place; avoid with autograd |
x = torch.randn(3, requires_grad=True)
y = x * 2
y.add_(1) # In-place modification of y
y.backward() # May raise error: "one of the variables needed for gradient
# computation has been modified by an inplace operation"
Rule: Avoid in-place operations on tensors that require gradients or are inputs to other operations in the computation graph. In-place ops are safe on tensors you create for temporary computation (e.g., modifying a buffer).