Eigendecomposition
Eigendecomposition reveals the intrinsic structure of a linear transformation by finding the directions along which the transformation acts as pure scaling. These directions (eigenvectors) and scaling factors (eigenvalues) appear everywhere in ML: the principal components of PCA are eigenvectors of the covariance matrix, the convergence rate of gradient descent depends on eigenvalues of the Hessian, and the connectivity of a graph is encoded in eigenvalues of its Laplacian. This chapter develops eigendecomposition from first principles, proves the spectral theorem for symmetric matrices, and connects these ideas to optimization and learning.
Eigenvalues and Eigenvectors
where is the corresponding eigenvalue. The matrix acts on by pure scaling: it stretches by factor and reverses direction if .
Eigenvalues are found by solving the characteristic equation:
This is a degree- polynomial in , so has exactly eigenvalues (counted with algebraic multiplicity, possibly complex).
For : . For : .
Adjust the matrix entries to see how a matrix transforms the unit square. The blue and red vectors show where the standard basis vectors get mapped. Light blue grid lines reveal how the entire plane deforms.
The dashed circle shows the unit circle; the blue ellipse shows its image under . Eigenvectors (red, green) are the directions where acts as pure scaling. Adjust the matrix to see how eigenvectors and eigenvalues change. When the discriminant is negative, the eigenvalues become complex and the transformation involves rotation.
Eigendecomposition
where is the matrix of eigenvectors (columns) and is the diagonal matrix of eigenvalues. Not every matrix is diagonalizable (e.g., ), but symmetric matrices always are.
Why this is useful:
- Matrix powers: , since
- Matrix exponential:
- Matrix functions: Any function
- Stability analysis: A linear dynamical system converges iff all
Take , whose true eigenpairs are with and with . Starting from :
Step 1 (full guidance). , with , so and .
Step 2 (faded). Now apply to , take the norm, and renormalize: , , giving and .
Step 3 (your turn). Repeating once more yields and . The iterate is already converging to and the Rayleigh estimate to .
The error contracts at rate per step (the ratio of the two largest eigenvalue magnitudes), which is why three iterations suffice here. Power iteration is the basis of PageRank and is used internally in many SVD algorithms.
Spectral Theorem
- All eigenvalues are real:
- Eigenvectors corresponding to distinct eigenvalues are orthogonal:
- has a complete set of orthonormal eigenvectors, giving the spectral decomposition:
where is orthogonal () and .
Proof sketch (eigenvalues are real). For symmetric with eigenvalue and eigenvector (possibly complex): . Also . Therefore , so .
- Covariance matrices : eigenvalues are variances along principal directions; eigenvectors are the principal components (PCA)
- Hessians : eigenvalues give curvature along each eigenvector direction; the largest eigenvalue is the Lipschitz constant of the gradient (the smoothness constant)
- Kernel matrices : eigenvalues determine the kernel's effective dimensionality; Mercer's theorem guarantees PSD kernels have non-negative eigenvalues
- Graph Laplacians (where is the adjacency matrix): eigenvalues encode graph connectivity; the second-smallest eigenvalue (Fiedler value) measures algebraic connectivity. (Here and already denote the Cholesky factor and the generic input matrix, so we write and ; the Graph Neural Networks chapter, where no such clash arises, uses the standard .)
Positive Definite Matrices
| Positive Definite () | Positive Semi-Definite () |
|---|---|
| All eigenvalues | All eigenvalues |
| All leading minors | All principal minors |
| Unique Cholesky: | Cholesky exists (may have zeros on diagonal) |
| for full-rank | for some |
| exists and is PD | (pseudoinverse) is PSD |
The minors test is asymmetric between the two columns. For PD, Sylvester's criterion needs only the leading principal minors to be positive. For PSD, this is not enough: all principal minors (every diagonal submatrix, not just the leading ones) must be nonnegative. For example, has both leading minors equal to yet is not PSD.
| Eigenvalue Structure | Classification |
|---|---|
| All | Local minimum |
| All | Local maximum |
| Mixed signs | Saddle point |
| Some | Degenerate (need higher-order analysis) |
In high-dimensional loss landscapes (e.g., parameters), the probability that a random critical point is a local minimum (all eigenvalues positive) is astronomically small. Most critical points are saddle points, which helps explain why gradient descent does not get stuck in local minima: it slides off saddle points along the directions of negative curvature (Dauphin et al., 2014).
Condition Number
A perturbation in the right-hand side causes a relative error bounded by .
| Condition | GD Convergence | Numerical Effect | |
|---|---|---|---|
| Well-conditioned | to | Fast, direct path to minimum | Results accurate to |
| Moderately ill-conditioned | to | Oscillation, slow progress | Lose 3 to 6 digits of accuracy |
| Severely ill-conditioned | Effectively stuck | Results may be meaningless in FP64 |
Rayleigh Quotient
The Rayleigh quotient satisfies for all , with equality at the corresponding eigenvectors. The extremal eigenvalues are:
Gershgorin Circle Theorem
Each disc is centered at the diagonal entry with radius equal to the sum of absolute values of the off-diagonal entries in row .
Notation Summary
| Symbol | Meaning |
|---|---|
| Eigenvalue(s) | |
| Eigenvector(s) | |
| Matrix of eigenvectors (columns) | |
| Diagonal matrix of eigenvalues | |
| Orthogonal matrix of eigenvectors () | |
| Condition number | |
| Rayleigh quotient | |
| PD, PSD | Positive definite, positive semi-definite |
| is positive definite | |
| is positive semi-definite | |
| Hessian matrix | |
| Gradient Lipschitz (smoothness) constant | |
| Cholesky factor () | |
| Graph Laplacian ( adjacency, degree matrix) |