SVD and PCA
The Singular Value Decomposition (SVD) is arguably the most important factorization in all of applied mathematics. It generalizes eigendecomposition to rectangular matrices, provides the optimal low-rank approximation to any matrix, and is the mathematical engine behind PCA, latent semantic analysis, recommender systems, and LoRA. Every data scientist who has reduced dimensionality, every engineer who has compressed a model, and every researcher who has analyzed a weight matrix has used the SVD, whether they knew it or not.
Singular Value Decomposition
where:
- is orthogonal (), its columns are the left singular vectors
- is orthogonal (), its columns are the right singular vectors
- is diagonal with non-negative entries called singular values, where
This decomposition always exists and the singular values are unique (though and may not be unique when singular values are repeated).
Connection to eigendecomposition. The singular values and vectors are derived from the eigendecompositions of and :
- : the right singular vectors are eigenvectors of , and are its eigenvalues
- : the left singular vectors are eigenvectors of
- Rotate (by ): align with the principal axes of the transformation
- Scale (by ): stretch or compress along each axis by
- Rotate (by ): rotate the scaled result into the output space
This reveals the deep structure: every linear map is a rotation, then a scale, then a rotation. The singular values tell you how much each dimension is stretched or compressed.
Compact and Truncated Forms
In the compact and truncated forms, , , , and have orthonormal columns but are not square orthogonal matrices (since they are not square).
| Form | Dimensions | Storage | When to use |
|---|---|---|---|
| Full SVD | , , | Theoretical analysis | |
| Compact SVD | , , | When | |
| Truncated SVD | , , | Dimensionality reduction |
Low-Rank Approximation
No other rank- matrix achieves a smaller error in either norm.
Proof sketch (Frobenius). Write . The squared Frobenius norm is (since the are orthonormal in the Frobenius inner product). The error of any rank- approximation satisfies (by the min-max characterization of singular values). The truncated SVD achieves this bound with equality.
- Rapid decay ( for ): a small captures most of the energy. Natural images, text embeddings, and weight matrices often exhibit this.
- Slow decay ( for many ): the matrix is "effectively full rank" and low-rank approximation loses substantial information. Random matrices have this property.
- Energy fraction quantifies how well approximates .
Principal Component Analysis
Algorithm (via SVD):
- Center the data:
- Compute the SVD:
- The principal components are the columns of (right singular vectors)
- The variance explained by component is
- Project data onto components:
Equivalently, PCA eigendecomposes the sample covariance matrix: The eigenvectors of are the right singular vectors of , and the eigenvalues of are .
- Maximum variance: The first principal components capture more variance than any other -dimensional linear projection (by Eckart-Young).
- Minimum reconstruction error: The projection followed by reconstruction minimizes over all rank- linear projections.
These two views are equivalent because the total variance is fixed, so maximizing captured variance is the same as minimizing lost variance (reconstruction error).
Explained variance ratio for components:
A common heuristic is to choose such that (95% of the variance).
- Center. The column means are , so subtracting gives
Note that features 1 and 3 are now identical, so the data really lives in a 2-dimensional subspace.
- Covariance. :
- Eigendecompose. with eigenvalues , , and eigenvectors
The total variance is , so and : the first two components capture everything.
- Project to . , whose rows are :
- Reconstruct. recovers exactly.
- Error. : because , the discarded direction held no variance and the reconstruction is lossless.
In practice, use torch.pca_lowrank(X, q=k) or sklearn.decomposition.PCA(n_components=k), which compute the truncated SVD directly without forming the covariance matrix (more numerically stable and efficient for ).
PCA Limitations and Extensions
| Limitation of PCA | Alternative | Key Idea |
|---|---|---|
| Only captures linear structure | Kernel PCA | Apply PCA in kernel-induced feature space |
| Sensitive to outliers | Robust PCA | Decompose (low-rank + sparse) |
| Assumes Gaussian-like data | ICA | Find statistically independent (not just uncorrelated) components |
| Variance information | Autoencoders | Learn nonlinear low-dimensional representations |
| Global structure only | t-SNE, UMAP | Preserve local neighborhood structure for visualization |
Applications of SVD in ML
| Application | How SVD/PCA Is Used | Computational Note |
|---|---|---|
| Dimensionality reduction | Project features onto top- PCs | Randomized SVD for large matrices |
| Data visualization | Project to 2D/3D via PCA | Often followed by t-SNE/UMAP |
| Noise reduction | Discard small singular values (soft/hard thresholding) | Threshold tuned to the noise level (e.g. via singular-value hard thresholding) |
| LoRA | Low-rank weight updates | Rank to typical |
| Recommender systems | Matrix factorization | Netflix Prize approach |
| Word embeddings | SVD of PMI matrix | GloVe is equivalent to weighted SVD |
| Image compression | Truncated SVD per channel | Lossy compression with controllable quality |
| Pseudoinverse | Numerically stable least squares | |
| Spectral clustering | Eigenvectors of graph Laplacian | SVD of normalized adjacency |
| Whitening | Decorrelate and normalize features |
- Generate a random matrix (with oversampling to )
- Form (one pass over )
- Compute QR:
- Form (second pass over )
- Compute SVD of the small matrix
- Set
This is implemented in sklearn.decomposition.TruncatedSVD and torch.svd_lowrank.
Notation Summary
| Symbol | Meaning |
|---|---|
| Left singular vectors (orthonormal columns) | |
| Diagonal matrix of singular values | |
| Right singular vectors (orthonormal columns) | |
| -th singular value () | |
| -th left/right singular vector | |
| Rank of the matrix | |
| Number of retained components | |
| Best rank- approximation | |
| Sample covariance matrix | |
| EVR | Explained variance ratio |
| Moore-Penrose pseudoinverse |