Skip to main content

SVD and PCA

The Singular Value Decomposition (SVD) is arguably the most important factorization in all of applied mathematics. It generalizes eigendecomposition to rectangular matrices, provides the optimal low-rank approximation to any matrix, and is the mathematical engine behind PCA, latent semantic analysis, recommender systems, and LoRA. Every data scientist who has reduced dimensionality, every engineer who has compressed a model, and every researcher who has analyzed a weight matrix has used the SVD -- whether they knew it or not.

Singular Value Decomposition

**Every** matrix $A \in \mathbb{R}^{m \times n}$ (not necessarily square) can be factored as:

A=UΣVA = U \Sigma V^\top

where:

  • URm×mU \in \mathbb{R}^{m \times m} is orthogonal (UU=IU^\top U = I), its columns are the left singular vectors
  • VRn×nV \in \mathbb{R}^{n \times n} is orthogonal (VV=IV^\top V = I), its columns are the right singular vectors
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is diagonal with non-negative entries σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 called singular values, where r=rank(A)r = \text{rank}(A)

This decomposition always exists and the singular values are unique (though UU and VV may not be unique when singular values are repeated).

Connection to eigendecomposition. The singular values and vectors are derived from the eigendecompositions of AAA^\top A and AAAA^\top:

  • AA=VΣΣVA^\top A = V \Sigma^\top \Sigma V^\top: the right singular vectors VV are eigenvectors of AAA^\top A, and σi2\sigma_i^2 are its eigenvalues
  • AA=UΣΣUAA^\top = U \Sigma \Sigma^\top U^\top: the left singular vectors UU are eigenvectors of AAAA^\top
**Geometric interpretation.** Every linear transformation $x \mapsto Ax$ is equivalent to three steps:
  1. Rotate (by VV^\top): align with the principal axes of the transformation
  2. Scale (by Σ\Sigma): stretch or compress along each axis by σi\sigma_i
  3. Rotate (by UU): rotate the scaled result into the output space

This reveals the deep structure: every linear map is a rotation--scale--rotation. The singular values tell you how much each dimension is stretched or compressed.

**SVD of an image.** A grayscale image of size $m \times n$ can be decomposed via SVD. The rank-$k$ approximation $A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$ uses only $k(m + n + 1)$ numbers instead of $mn$. For a $1000 \times 1000$ image with $k = 50$: storage drops from $10^6$ to $\sim 10^5$ (10x compression) while retaining $\sum_{i=1}^{50}\sigma_i^2 / \sum_i \sigma_i^2$ of the image's energy.

Compact and Truncated Forms

FormDimensionsStorageWhen to use
Full SVDURm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}m2+mn+n2m^2 + mn + n^2Theoretical analysis
Compact SVDUrRm×rU_r \in \mathbb{R}^{m \times r}, ΣrRr×r\Sigma_r \in \mathbb{R}^{r \times r}, VrRn×rV_r \in \mathbb{R}^{n \times r}(m+n)r+r(m+n)r + rWhen rmin(m,n)r \ll \min(m,n)
Truncated SVDUkRm×kU_k \in \mathbb{R}^{m \times k}, ΣkRk×k\Sigma_k \in \mathbb{R}^{k \times k}, VkRn×kV_k \in \mathbb{R}^{n \times k}(m+n)k+k(m+n)k + kDimensionality reduction

Low-Rank Approximation

The truncated SVD $A_k = U_k \Sigma_k V_k^\top = \sum_{i=1}^{k} \sigma_i u_i v_i^\top$ is the **best rank-$k$ approximation** to $A$ in both the Frobenius and spectral norms:

Ak=argminrank(B)kABF,AAkF=i=k+1rσi2A_k = \arg\min_{\text{rank}(B) \leq k} \|A - B\|_F, \qquad \|A - A_k\|_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2}

Ak=argminrank(B)kAB2,AAk2=σk+1A_k = \arg\min_{\text{rank}(B) \leq k} \|A - B\|_2, \qquad \|A - A_k\|_2 = \sigma_{k+1}

No other rank-kk matrix achieves a smaller error in either norm.

Proof sketch (Frobenius). Write A=i=1rσiuiviA = \sum_{i=1}^r \sigma_i u_i v_i^\top. The squared Frobenius norm is AF2=iσi2\|A\|_F^2 = \sum_i \sigma_i^2 (since the uiviu_i v_i^\top are orthonormal in the Frobenius inner product). The error of any rank-kk approximation satisfies ABF2i=k+1rσi2\|A - B\|_F^2 \geq \sum_{i=k+1}^r \sigma_i^2 (by the min-max characterization of singular values). The truncated SVD achieves this bound with equality. \square

**When is low-rank approximation effective?** The approximation quality depends on the **singular value spectrum**:
  • Rapid decay (σiiα\sigma_i \propto i^{-\alpha} for α>1\alpha > 1): a small kk captures most of the energy. Natural images, text embeddings, and weight matrices often exhibit this.
  • Slow decay (σiσ1\sigma_i \approx \sigma_1 for many ii): the matrix is "effectively full rank" and low-rank approximation loses substantial information. Random matrices have this property.
  • Energy fraction EVR(k)=i=1kσi2/iσi2\text{EVR}(k) = \sum_{i=1}^k \sigma_i^2 / \sum_i \sigma_i^2 quantifies how well AkA_k approximates AA.
**LoRA as approximate truncated SVD** [@hu2022lora]. LoRA parameterizes the weight update as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$. This constrains $\Delta W$ to have rank at most $r$. Unlike truncated SVD, LoRA does not require computing the SVD of the full weight matrix -- it learns the best rank-$r$ update directly via gradient descent. The Eckart--Young theorem guarantees that if the true optimal $\Delta W^*$ has rapidly decaying singular values, a small $r$ suffices.

Principal Component Analysis

**PCA** finds the directions of maximum variance in a dataset $X \in \mathbb{R}^{m \times n}$ (assumed centered: $\bar{x} = 0$).

Algorithm (via SVD):

  1. Center the data: XX1xˉX \leftarrow X - \mathbf{1}\bar{x}^\top
  2. Compute the SVD: X=UΣVX = U \Sigma V^\top
  3. The principal components are the columns of VV (right singular vectors)
  4. The variance explained by component ii is σi2/(m1)\sigma_i^2 / (m - 1)
  5. Project data onto kk components: Z=XVkRm×kZ = X V_k \in \mathbb{R}^{m \times k}

Equivalently, PCA eigendecomposes the sample covariance matrix: C=1m1XX=VΣ2m1VC = \frac{1}{m-1}X^\top X = V \frac{\Sigma^2}{m-1} V^\top The eigenvectors of CC are the right singular vectors of XX, and the eigenvalues of CC are σi2/(m1)\sigma_i^2/(m-1).

**PCA is optimal in two equivalent senses:**
  1. Maximum variance: The first kk principal components capture more variance than any other kk-dimensional linear projection (by Eckart--Young).
  2. Minimum reconstruction error: The projection Z=XVkZ = XV_k followed by reconstruction X^=ZVk\hat{X} = ZV_k^\top minimizes XX^F2\|X - \hat{X}\|_F^2 over all rank-kk linear projections.

These two views are equivalent because the total variance tr(C)=iσi2/(m1)\text{tr}(C) = \sum_i \sigma_i^2/(m-1) is fixed, so maximizing captured variance is the same as minimizing lost variance (reconstruction error).

Explained variance ratio for kk components:

EVR(k)=i=1kσi2i=1rσi2\text{EVR}(k) = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{r} \sigma_i^2}

A common heuristic is to choose kk such that EVR(k)0.95\text{EVR}(k) \geq 0.95 (95% of the variance).

**PCA on MNIST.** The MNIST dataset has 784 pixel features per image ($28 \times 28$). PCA reveals that $\sim 95\%$ of the variance is captured by the first $\sim 150$ components, meaning the data lives in a roughly 150-dimensional subspace of $\mathbb{R}^{784}$. The first two principal components often separate different digits visually, though the separation is imperfect (nonlinear methods like t-SNE or UMAP do better for visualization). **PCA computation step by step.** Given data matrix $X \in \mathbb{R}^{4 \times 3}$ (4 samples, 3 features):
  1. Center: xˉ=14ixi\bar{x} = \frac{1}{4}\sum_i x_i, then X~=X1xˉ\tilde{X} = X - \mathbf{1}\bar{x}^\top
  2. Covariance: C=13X~X~R3×3C = \frac{1}{3}\tilde{X}^\top \tilde{X} \in \mathbb{R}^{3 \times 3}
  3. Eigendecompose: C=VΛVC = V \Lambda V^\top with λ1λ2λ30\lambda_1 \geq \lambda_2 \geq \lambda_3 \geq 0
  4. Project to k=2k=2: Z=X~V:,1:2R4×2Z = \tilde{X} V_{:,1:2} \in \mathbb{R}^{4 \times 2}
  5. Reconstruct: X^=ZV:,1:2+1xˉ\hat{X} = Z V_{:,1:2}^\top + \mathbf{1}\bar{x}^\top
  6. Error: X~ZV:,1:2F2=λ3\|\tilde{X} - Z V_{:,1:2}^\top\|_F^2 = \lambda_3 (the discarded variance)

In practice, use torch.pca_lowrank(X, q=k) or sklearn.decomposition.PCA(n_components=k), which compute the truncated SVD directly without forming the covariance matrix (more numerically stable and efficient for ndn \gg d).

PCA Limitations and Extensions

Limitation of PCAAlternativeKey Idea
Only captures linear structureKernel PCAApply PCA in kernel-induced feature space
Sensitive to outliersRobust PCADecompose M=L+SM = L + S (low-rank + sparse)
Assumes Gaussian-like dataICAFind statistically independent (not just uncorrelated) components
Variance \neq informationAutoencodersLearn nonlinear low-dimensional representations
Global structure onlyt-SNE, UMAPPreserve local neighborhood structure for visualization

Applications of SVD in ML

ApplicationHow SVD/PCA Is UsedComputational Note
Dimensionality reductionProject features onto top-kk PCsRandomized SVD for large matrices
Data visualizationProject to 2D/3D via PCAOften followed by t-SNE/UMAP
Noise reductionDiscard small singular values (soft/hard thresholding)Optimal threshold: σ>σmed2logn\sigma > \sigma_{\text{med}} \sqrt{2\log n}
LoRALow-rank weight updates ΔW=BA\Delta W = BARank r=4r = 4--6464 typical
Recommender systemsMatrix factorization RUΣVR \approx U \Sigma V^\topNetflix Prize approach
Word embeddingsSVD of PMI matrixGloVe is equivalent to weighted SVD
Image compressionTruncated SVD per channelLossy compression with controllable quality
PseudoinverseA+=VΣ+UA^+ = V \Sigma^+ U^\topNumerically stable least squares
Spectral clusteringEigenvectors of graph LaplacianSVD of normalized adjacency
WhiteningXw=XVΣ1X_w = X V \Sigma^{-1}Decorrelate and normalize features
**Randomized SVD.** For large matrices ($m, n \gg k$), computing the full SVD is $O(mn \min(m,n))$, which is prohibitive. Randomized algorithms [@halko2011finding] compute an approximate rank-$k$ SVD in $O(mnk)$ time:
  1. Generate a random matrix ΩRn×(k+p)\Omega \in \mathbb{R}^{n \times (k+p)} (with oversampling p5p \approx 5--1010)
  2. Form Y=AΩRm×(k+p)Y = A\Omega \in \mathbb{R}^{m \times (k+p)} (one pass over AA)
  3. Compute QR: Y=QRY = QR
  4. Form B=QAR(k+p)×nB = Q^\top A \in \mathbb{R}^{(k+p) \times n} (second pass over AA)
  5. Compute SVD of the small matrix B=U^ΣVB = \hat{U}\Sigma V^\top
  6. Set U=QU^U = Q\hat{U}

This is implemented in sklearn.decomposition.TruncatedSVD and torch.svd_lowrank.

Notation Summary

SymbolMeaning
UULeft singular vectors (orthonormal columns)
Σ\SigmaDiagonal matrix of singular values
VVRight singular vectors (orthonormal columns)
σi\sigma_iii-th singular value (σ1σ2\sigma_1 \geq \sigma_2 \geq \cdots)
ui,viu_i, v_iii-th left/right singular vector
rrRank of the matrix
kkNumber of retained components
AkA_kBest rank-kk approximation
CCSample covariance matrix
EVRExplained variance ratio
A+A^+Moore--Penrose pseudoinverse