Eigendecomposition reveals the intrinsic structure of a linear transformation by finding the directions along which the transformation acts as pure scaling. These directions (eigenvectors) and scaling factors (eigenvalues) appear everywhere in ML: the principal components of PCA are eigenvectors of the covariance matrix, the convergence rate of gradient descent depends on eigenvalues of the Hessian, and the connectivity of a graph is encoded in eigenvalues of its Laplacian. This chapter develops eigendecomposition from first principles, proves the spectral theorem for symmetric matrices, and connects these ideas to optimization and learning.
Eigenvalues and Eigenvectors
An **eigenvector** $v \neq 0$ of a square matrix $A \in \mathbb{R}^{n \times n}$ satisfies:
Av=λv
where λ∈C is the corresponding eigenvalue. The matrix A acts on v by pure scaling: it stretches v by factor ∣λ∣ and reverses direction if λ<0.
Eigenvalues are found by solving the characteristic equation:
det(A−λI)=0
This is a degree-n polynomial in λ, so A has exactly n eigenvalues (counted with algebraic multiplicity, possibly complex).
**2x2 eigendecomposition.** For $A = \begin{pmatrix} 3 & 1 \\ 0 & 2 \end{pmatrix}$:
det(A−λI)=(3−λ)(2−λ)=0⟹λ1=3,λ2=2
For λ1=3: (A−3I)v=0⟹v1=(10). For λ2=2: (A−2I)v=0⟹v2=(−11).
**What eigenvalues tell you:**
- $|\lambda_i|$ = how much $A$ stretches along the $i$-th eigenvector direction
- $\text{sign}(\lambda_i)$ = whether $A$ reverses direction along that eigenvector
- $\sum_i \lambda_i = \text{tr}(A)$ = total scaling (sum of stretches)
- $\prod_i \lambda_i = \det(A)$ = volume scaling factor
Eigendecomposition
If $A \in \mathbb{R}^{n \times n}$ has $n$ linearly independent eigenvectors, it can be decomposed as:
A=VΛV−1
where V=[v1∣v2∣⋯∣vn] is the matrix of eigenvectors (columns) and Λ=diag(λ1,…,λn) is the diagonal matrix of eigenvalues. Not every matrix is diagonalizable (e.g., (0010)), but symmetric matrices always are.
Why this is useful:
- Matrix powers: Ak=VΛkV−1, since Λk=diag(λ1k,…,λnk)
- Matrix exponential: eA=VeΛV−1=Vdiag(eλ1,…,eλn)V−1
- Matrix functions: Any function f(A)=Vdiag(f(λ1),…,f(λn))V−1
- Stability analysis: A linear dynamical system xt+1=Axt converges iff all ∣λi∣<1
**Power iteration.** The dominant eigenvector (corresponding to the largest $|\lambda|$) can be found by repeatedly multiplying by $A$ and normalizing:
v(t+1)=∥Av(t)∥Av(t)
This converges at rate ∣λ2/λ1∣ (the ratio of the two largest eigenvalue magnitudes). Power iteration is the basis of PageRank and is used internally in many SVD algorithms.
Spectral Theorem
If $A \in \mathbb{R}^{n \times n}$ is **symmetric** ($A = A^\top$), then:
- All eigenvalues are real: λi∈R
- Eigenvectors corresponding to distinct eigenvalues are orthogonal: λi=λj⟹vi⊤vj=0
- A has a complete set of orthonormal eigenvectors, giving the spectral decomposition:
A=QΛQ⊤=∑i=1nλiqiqi⊤
where Q=[q1∣⋯∣qn] is orthogonal (Q⊤Q=I) and Λ=diag(λ1,…,λn).
Proof sketch (eigenvalues are real). For symmetric A with eigenvalue λ and eigenvector v (possibly complex): vˉ⊤Av=vˉ⊤(λv)=λ∥v∥2. Also vˉ⊤Av=(A⊤vˉ)⊤v=(Avˉ)⊤v=(λˉvˉ)⊤v=λˉ∥v∥2. Therefore λ=λˉ, so λ∈R. □
The spectral theorem is foundational for ML because the key matrices are symmetric:
- Covariance matrices Σ: eigenvalues are variances along principal directions; eigenvectors are the principal components (PCA)
- Hessians ∇2L: eigenvalues give curvature along each eigenvector direction; the largest eigenvalue is the Lipschitz constant L of the gradient
- Kernel matrices K: eigenvalues determine the kernel's effective dimensionality; Mercer's theorem guarantees PSD kernels have non-negative eigenvalues
- Graph Laplacians L=D−A: eigenvalues encode graph connectivity; the second-smallest eigenvalue (Fiedler value) measures algebraic connectivity
Positive Definite Matrices
A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is:
- **Positive definite (PD):** $x^\top A x > 0$ for all $x \neq 0$
- **Positive semi-definite (PSD):** $x^\top A x \geq 0$ for all $x$
- **Negative definite:** $x^\top A x < 0$ for all $x \neq 0$
- **Indefinite:** $x^\top A x$ takes both positive and negative values
| Positive Definite (A≻0) | Positive Semi-Definite (A⪰0) |
|---|
| All eigenvalues λi>0 | All eigenvalues λi≥0 |
| All leading minors >0 | All principal minors ≥0 |
| Unique Cholesky: A=LL⊤ | Cholesky exists (may have zeros on diagonal) |
| A=R⊤R for full-rank R | A=R⊤R for some R |
| A−1 exists and is PD | A+ (pseudoinverse) is PSD |
**PSD matrices in ML:**
- **Covariance matrices** are always PSD (and PD if data spans the full space)
- $X^\top X$ is PSD for any $X$ (since $z^\top X^\top X z = \|Xz\|^2 \geq 0$)
- **Kernel matrices** are PSD by Mercer's theorem
- **Fisher information matrices** are PSD
- The **Hessian** at a local minimum is PSD ($\nabla^2 \mathcal{L} \succeq 0$)
**Critical point classification via the Hessian.** At a critical point ($\nabla \mathcal{L} = 0$), the nature of the critical point is determined by the eigenvalues of the Hessian $H = \nabla^2 \mathcal{L}$:
| Eigenvalue Structure | Classification |
|---|
| All λi>0 | Local minimum |
| All λi<0 | Local maximum |
| Mixed signs | Saddle point |
| Some λi=0 | Degenerate (need higher-order analysis) |
In high-dimensional loss landscapes (e.g., P=109 parameters), the probability that a random critical point is a local minimum (all P eigenvalues positive) is astronomically small. Most critical points are saddle points, which helps explain why gradient descent does not get stuck in local minima -- it slides off saddle points along the directions of negative curvature ([?dauphin2014identifying]).
Condition Number
The **condition number** of a matrix $A$ (with respect to a norm) measures the sensitivity of $Ax = b$ to perturbations. For the 2-norm:
κ(A)=∥A∥⋅∥A−1∥=σmin(A)σmax(A)=λminλmax(for symmetric PD)
A perturbation δb in the right-hand side causes a relative error bounded by ∥x∥∥δx∥≤κ(A)∥b∥∥δb∥.
| Condition | κ(A) | GD Convergence | Numerical Effect |
|---|
| Well-conditioned | ∼1--10 | Fast, direct path to minimum | Results accurate to ∼ϵmach |
| Moderately ill-conditioned | 103--106 | Oscillation, slow progress | Lose 3--6 digits of accuracy |
| Severely ill-conditioned | >1016 | Effectively stuck | Results may be meaningless in FP64 |
**Preconditioning.** The convergence rate of gradient descent on a quadratic $f(x) = \frac{1}{2}x^\top Ax - b^\top x$ depends on the condition number $\kappa$ of $A$: GD converges as $\left(\frac{\kappa - 1}{\kappa + 1}\right)^t$. A **preconditioner** $M \approx A^{-1}$ transforms the system to $MAx = Mb$, reducing the effective condition number to $\kappa(MA) \approx 1$. Adam's per-parameter learning rates act as a diagonal preconditioner: dividing by $\sqrt{\hat{v}_t}$ approximates the inverse diagonal of the Hessian, reducing the effective condition number.
Rayleigh Quotient
For a symmetric matrix $A$ and nonzero vector $x$, the **Rayleigh quotient** is:
R(x)=x⊤xx⊤Ax
The Rayleigh quotient satisfies λmin≤R(x)≤λmax for all x=0, with equality at the corresponding eigenvectors. The extremal eigenvalues are:
λmax=maxx=0R(x),λmin=minx=0R(x)
PCA can be formulated as maximizing the Rayleigh quotient: the first principal component is $v_1 = \arg\max_{\|v\|=1} v^\top \Sigma v = \arg\max_{\|v\|=1} R_\Sigma(v)$, which is the eigenvector of $\Sigma$ with the largest eigenvalue (direction of maximum variance).
Gershgorin Circle Theorem
Every eigenvalue of a matrix $A \in \mathbb{R}^{n \times n}$ lies within at least one **Gershgorin disc**:
λ∈⋃i=1nD(Aii,Ri)whereRi=∑j=i∣Aij∣
Each disc D(Aii,Ri) is centered at the diagonal entry Aii with radius equal to the sum of absolute values of the off-diagonal entries in row i.
**Gershgorin in ML.** This theorem provides cheap eigenvalue bounds without computing the eigendecomposition. For a diagonally dominant matrix (where $|A_{ii}| > R_i$ for all $i$), Gershgorin guarantees all eigenvalues have the same sign as the diagonal entries -- hence the matrix is positive definite if all diagonal entries are positive and the matrix is diagonally dominant. This is useful for verifying that a kernel matrix or regularized Hessian is PD without computing eigenvalues.
Notation Summary
| Symbol | Meaning |
|---|
| λ,λi | Eigenvalue(s) |
| v,vi | Eigenvector(s) |
| V | Matrix of eigenvectors (columns) |
| Λ | Diagonal matrix of eigenvalues |
| Q | Orthogonal matrix of eigenvectors (Q⊤Q=I) |
| κ(A) | Condition number |
| R(x) | Rayleigh quotient |
| PD, PSD | Positive definite, positive semi-definite |
| A≻0 | A is positive definite |
| A⪰0 | A is positive semi-definite |
| H | Hessian matrix |
| L | Cholesky factor (A=LL⊤) |