The Singular Value Decomposition (SVD) is arguably the most important factorization in all of applied mathematics. It generalizes eigendecomposition to rectangular matrices, provides the optimal low-rank approximation to any matrix, and is the mathematical engine behind PCA, latent semantic analysis, recommender systems, and LoRA. Every data scientist who has reduced dimensionality, every engineer who has compressed a model, and every researcher who has analyzed a weight matrix has used the SVD -- whether they knew it or not.
**Every** matrix $A \in \mathbb{R}^{m \times n}$ (not necessarily square) can be factored as:
A=UΣV⊤
where:
U∈Rm×m is orthogonal (U⊤U=I), its columns are the left singular vectors
V∈Rn×n is orthogonal (V⊤V=I), its columns are the right singular vectors
Σ∈Rm×n is diagonal with non-negative entries σ1≥σ2≥⋯≥σr>0 called singular values, where r=rank(A)
This decomposition always exists and the singular values are unique (though U and V may not be unique when singular values are repeated).
Connection to eigendecomposition. The singular values and vectors are derived from the eigendecompositions of A⊤A and AA⊤:
A⊤A=VΣ⊤ΣV⊤: the right singular vectors V are eigenvectors of A⊤A, and σi2 are its eigenvalues
AA⊤=UΣΣ⊤U⊤: the left singular vectors U are eigenvectors of AA⊤
**Geometric interpretation.** Every linear transformation $x \mapsto Ax$ is equivalent to three steps:
Rotate (by V⊤): align with the principal axes of the transformation
Scale (by Σ): stretch or compress along each axis by σi
Rotate (by U): rotate the scaled result into the output space
This reveals the deep structure: every linear map is a rotation--scale--rotation. The singular values tell you how much each dimension is stretched or compressed.
**SVD of an image.** A grayscale image of size $m \times n$ can be decomposed via SVD. The rank-$k$ approximation $A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$ uses only $k(m + n + 1)$ numbers instead of $mn$. For a $1000 \times 1000$ image with $k = 50$: storage drops from $10^6$ to $\sim 10^5$ (10x compression) while retaining $\sum_{i=1}^{50}\sigma_i^2 / \sum_i \sigma_i^2$ of the image's energy.
The truncated SVD $A_k = U_k \Sigma_k V_k^\top = \sum_{i=1}^{k} \sigma_i u_i v_i^\top$ is the **best rank-$k$ approximation** to $A$ in both the Frobenius and spectral norms:
No other rank-k matrix achieves a smaller error in either norm.
Proof sketch (Frobenius). Write A=∑i=1rσiuivi⊤. The squared Frobenius norm is ∥A∥F2=∑iσi2 (since the uivi⊤ are orthonormal in the Frobenius inner product). The error of any rank-k approximation satisfies ∥A−B∥F2≥∑i=k+1rσi2 (by the min-max characterization of singular values). The truncated SVD achieves this bound with equality. □
**When is low-rank approximation effective?** The approximation quality depends on the **singular value spectrum**:
Rapid decay (σi∝i−α for α>1): a small k captures most of the energy. Natural images, text embeddings, and weight matrices often exhibit this.
Slow decay (σi≈σ1 for many i): the matrix is "effectively full rank" and low-rank approximation loses substantial information. Random matrices have this property.
Energy fractionEVR(k)=∑i=1kσi2/∑iσi2 quantifies how well Ak approximates A.
**LoRA as approximate truncated SVD** [@hu2022lora]. LoRA parameterizes the weight update as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$. This constrains $\Delta W$ to have rank at most $r$. Unlike truncated SVD, LoRA does not require computing the SVD of the full weight matrix -- it learns the best rank-$r$ update directly via gradient descent. The Eckart--Young theorem guarantees that if the true optimal $\Delta W^*$ has rapidly decaying singular values, a small $r$ suffices.
**PCA** finds the directions of maximum variance in a dataset $X \in \mathbb{R}^{m \times n}$ (assumed centered: $\bar{x} = 0$).
Algorithm (via SVD):
Center the data: X←X−1xˉ⊤
Compute the SVD: X=UΣV⊤
The principal components are the columns of V (right singular vectors)
The variance explained by component i is σi2/(m−1)
Project data onto k components: Z=XVk∈Rm×k
Equivalently, PCA eigendecomposes the sample covariance matrix:
C=m−11X⊤X=Vm−1Σ2V⊤
The eigenvectors of C are the right singular vectors of X, and the eigenvalues of C are σi2/(m−1).
**PCA is optimal in two equivalent senses:**
Maximum variance: The first k principal components capture more variance than any other k-dimensional linear projection (by Eckart--Young).
Minimum reconstruction error: The projection Z=XVk followed by reconstruction X^=ZVk⊤ minimizes ∥X−X^∥F2 over all rank-k linear projections.
These two views are equivalent because the total variance tr(C)=∑iσi2/(m−1) is fixed, so maximizing captured variance is the same as minimizing lost variance (reconstruction error).
Explained variance ratio for k components:
EVR(k)=∑i=1rσi2∑i=1kσi2
A common heuristic is to choose k such that EVR(k)≥0.95 (95% of the variance).
**PCA on MNIST.** The MNIST dataset has 784 pixel features per image ($28 \times 28$). PCA reveals that $\sim 95\%$ of the variance is captured by the first $\sim 150$ components, meaning the data lives in a roughly 150-dimensional subspace of $\mathbb{R}^{784}$. The first two principal components often separate different digits visually, though the separation is imperfect (nonlinear methods like t-SNE or UMAP do better for visualization).
**PCA computation step by step.** Given data matrix $X \in \mathbb{R}^{4 \times 3}$ (4 samples, 3 features):
Center:xˉ=41∑ixi, then X~=X−1xˉ⊤
Covariance:C=31X~⊤X~∈R3×3
Eigendecompose:C=VΛV⊤ with λ1≥λ2≥λ3≥0
Project to k=2:Z=X~V:,1:2∈R4×2
Reconstruct:X^=ZV:,1:2⊤+1xˉ⊤
Error:∥X~−ZV:,1:2⊤∥F2=λ3 (the discarded variance)
In practice, use torch.pca_lowrank(X, q=k) or sklearn.decomposition.PCA(n_components=k), which compute the truncated SVD directly without forming the covariance matrix (more numerically stable and efficient for n≫d).
Discard small singular values (soft/hard thresholding)
Optimal threshold: σ>σmed2logn
LoRA
Low-rank weight updates ΔW=BA
Rank r=4--64 typical
Recommender systems
Matrix factorization R≈UΣV⊤
Netflix Prize approach
Word embeddings
SVD of PMI matrix
GloVe is equivalent to weighted SVD
Image compression
Truncated SVD per channel
Lossy compression with controllable quality
Pseudoinverse
A+=VΣ+U⊤
Numerically stable least squares
Spectral clustering
Eigenvectors of graph Laplacian
SVD of normalized adjacency
Whitening
Xw=XVΣ−1
Decorrelate and normalize features
**Randomized SVD.** For large matrices ($m, n \gg k$), computing the full SVD is $O(mn \min(m,n))$, which is prohibitive. Randomized algorithms [@halko2011finding] compute an approximate rank-$k$ SVD in $O(mnk)$ time:
Generate a random matrix Ω∈Rn×(k+p) (with oversampling p≈5--10)
Form Y=AΩ∈Rm×(k+p) (one pass over A)
Compute QR: Y=QR
Form B=Q⊤A∈R(k+p)×n (second pass over A)
Compute SVD of the small matrix B=U^ΣV⊤
Set U=QU^
This is implemented in sklearn.decomposition.TruncatedSVD and torch.svd_lowrank.