Skip to main content

Eigendecomposition

Eigendecomposition reveals the intrinsic structure of a linear transformation by finding the directions along which the transformation acts as pure scaling. These directions (eigenvectors) and scaling factors (eigenvalues) appear everywhere in ML: the principal components of PCA are eigenvectors of the covariance matrix, the convergence rate of gradient descent depends on eigenvalues of the Hessian, and the connectivity of a graph is encoded in eigenvalues of its Laplacian. This chapter develops eigendecomposition from first principles, proves the spectral theorem for symmetric matrices, and connects these ideas to optimization and learning.

Eigenvalues and Eigenvectors

An **eigenvector** $v \neq 0$ of a square matrix $A \in \mathbb{R}^{n \times n}$ satisfies:

Av=λvAv = \lambda v

where λC\lambda \in \mathbb{C} is the corresponding eigenvalue. The matrix AA acts on vv by pure scaling: it stretches vv by factor λ|\lambda| and reverses direction if λ<0\lambda < 0.

Eigenvalues are found by solving the characteristic equation:

det(AλI)=0\det(A - \lambda I) = 0

This is a degree-nn polynomial in λ\lambda, so AA has exactly nn eigenvalues (counted with algebraic multiplicity, possibly complex).

**2x2 eigendecomposition.** For $A = \begin{pmatrix} 3 & 1 \\ 0 & 2 \end{pmatrix}$:

det(AλI)=(3λ)(2λ)=0    λ1=3,  λ2=2\det(A - \lambda I) = (3 - \lambda)(2 - \lambda) = 0 \implies \lambda_1 = 3, \; \lambda_2 = 2

For λ1=3\lambda_1 = 3: (A3I)v=0    v1=(10)(A - 3I)v = 0 \implies v_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}. For λ2=2\lambda_2 = 2: (A2I)v=0    v2=(11)(A - 2I)v = 0 \implies v_2 = \begin{pmatrix} -1 \\ 1 \end{pmatrix}.

**What eigenvalues tell you:** - $|\lambda_i|$ = how much $A$ stretches along the $i$-th eigenvector direction - $\text{sign}(\lambda_i)$ = whether $A$ reverses direction along that eigenvector - $\sum_i \lambda_i = \text{tr}(A)$ = total scaling (sum of stretches) - $\prod_i \lambda_i = \det(A)$ = volume scaling factor

Eigendecomposition

If $A \in \mathbb{R}^{n \times n}$ has $n$ linearly independent eigenvectors, it can be decomposed as:

A=VΛV1A = V \Lambda V^{-1}

where V=[v1v2vn]V = [v_1 | v_2 | \cdots | v_n] is the matrix of eigenvectors (columns) and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \dots, \lambda_n) is the diagonal matrix of eigenvalues. Not every matrix is diagonalizable (e.g., (0100)\begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}), but symmetric matrices always are.

Why this is useful:

  1. Matrix powers: Ak=VΛkV1A^k = V \Lambda^k V^{-1}, since Λk=diag(λ1k,,λnk)\Lambda^k = \text{diag}(\lambda_1^k, \dots, \lambda_n^k)
  2. Matrix exponential: eA=VeΛV1=Vdiag(eλ1,,eλn)V1e^A = V e^\Lambda V^{-1} = V \text{diag}(e^{\lambda_1}, \dots, e^{\lambda_n}) V^{-1}
  3. Matrix functions: Any function f(A)=Vdiag(f(λ1),,f(λn))V1f(A) = V \text{diag}(f(\lambda_1), \dots, f(\lambda_n)) V^{-1}
  4. Stability analysis: A linear dynamical system xt+1=Axtx_{t+1} = Ax_t converges iff all λi<1|\lambda_i| < 1
**Power iteration.** The dominant eigenvector (corresponding to the largest $|\lambda|$) can be found by repeatedly multiplying by $A$ and normalizing:

v(t+1)=Av(t)Av(t)v^{(t+1)} = \frac{A v^{(t)}}{\|A v^{(t)}\|}

This converges at rate λ2/λ1|\lambda_2 / \lambda_1| (the ratio of the two largest eigenvalue magnitudes). Power iteration is the basis of PageRank and is used internally in many SVD algorithms.

Spectral Theorem

If $A \in \mathbb{R}^{n \times n}$ is **symmetric** ($A = A^\top$), then:
  1. All eigenvalues are real: λiR\lambda_i \in \mathbb{R}
  2. Eigenvectors corresponding to distinct eigenvalues are orthogonal: λiλj    vivj=0\lambda_i \neq \lambda_j \implies v_i^\top v_j = 0
  3. AA has a complete set of orthonormal eigenvectors, giving the spectral decomposition:

A=QΛQ=i=1nλiqiqiA = Q \Lambda Q^\top = \sum_{i=1}^{n} \lambda_i q_i q_i^\top

where Q=[q1qn]Q = [q_1 | \cdots | q_n] is orthogonal (QQ=IQ^\top Q = I) and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \dots, \lambda_n).

Proof sketch (eigenvalues are real). For symmetric AA with eigenvalue λ\lambda and eigenvector vv (possibly complex): vˉAv=vˉ(λv)=λv2\bar{v}^\top A v = \bar{v}^\top (\lambda v) = \lambda \|v\|^2. Also vˉAv=(Avˉ)v=(Avˉ)v=(λˉvˉ)v=λˉv2\bar{v}^\top A v = (A^\top \bar{v})^\top v = (A\bar{v})^\top v = (\bar{\lambda} \bar{v})^\top v = \bar{\lambda} \|v\|^2. Therefore λ=λˉ\lambda = \bar{\lambda}, so λR\lambda \in \mathbb{R}. \square

The spectral theorem is foundational for ML because the key matrices are symmetric:
  • Covariance matrices Σ\Sigma: eigenvalues are variances along principal directions; eigenvectors are the principal components (PCA)
  • Hessians 2L\nabla^2 \mathcal{L}: eigenvalues give curvature along each eigenvector direction; the largest eigenvalue is the Lipschitz constant LL of the gradient
  • Kernel matrices KK: eigenvalues determine the kernel's effective dimensionality; Mercer's theorem guarantees PSD kernels have non-negative eigenvalues
  • Graph Laplacians L=DAL = D - A: eigenvalues encode graph connectivity; the second-smallest eigenvalue (Fiedler value) measures algebraic connectivity

Positive Definite Matrices

A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is: - **Positive definite (PD):** $x^\top A x > 0$ for all $x \neq 0$ - **Positive semi-definite (PSD):** $x^\top A x \geq 0$ for all $x$ - **Negative definite:** $x^\top A x < 0$ for all $x \neq 0$ - **Indefinite:** $x^\top A x$ takes both positive and negative values
Positive Definite (A0A \succ 0)Positive Semi-Definite (A0A \succeq 0)
All eigenvalues λi>0\lambda_i > 0All eigenvalues λi0\lambda_i \geq 0
All leading minors >0> 0All principal minors 0\geq 0
Unique Cholesky: A=LLA = LL^\topCholesky exists (may have zeros on diagonal)
A=RRA = R^\top R for full-rank RRA=RRA = R^\top R for some RR
A1A^{-1} exists and is PDA+A^+ (pseudoinverse) is PSD
**PSD matrices in ML:** - **Covariance matrices** are always PSD (and PD if data spans the full space) - $X^\top X$ is PSD for any $X$ (since $z^\top X^\top X z = \|Xz\|^2 \geq 0$) - **Kernel matrices** are PSD by Mercer's theorem - **Fisher information matrices** are PSD - The **Hessian** at a local minimum is PSD ($\nabla^2 \mathcal{L} \succeq 0$) **Critical point classification via the Hessian.** At a critical point ($\nabla \mathcal{L} = 0$), the nature of the critical point is determined by the eigenvalues of the Hessian $H = \nabla^2 \mathcal{L}$:
Eigenvalue StructureClassification
All λi>0\lambda_i > 0Local minimum
All λi<0\lambda_i < 0Local maximum
Mixed signsSaddle point
Some λi=0\lambda_i = 0Degenerate (need higher-order analysis)

In high-dimensional loss landscapes (e.g., P=109P = 10^9 parameters), the probability that a random critical point is a local minimum (all PP eigenvalues positive) is astronomically small. Most critical points are saddle points, which helps explain why gradient descent does not get stuck in local minima -- it slides off saddle points along the directions of negative curvature ([?dauphin2014identifying]).

Condition Number

The **condition number** of a matrix $A$ (with respect to a norm) measures the sensitivity of $Ax = b$ to perturbations. For the 2-norm:

κ(A)=AA1=σmax(A)σmin(A)=λmaxλmin(for symmetric PD)\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)} = \frac{\lambda_{\max}}{\lambda_{\min}} \quad \text{(for symmetric PD)}

A perturbation δb\delta b in the right-hand side causes a relative error bounded by δxxκ(A)δbb\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \frac{\|\delta b\|}{\|b\|}.

Conditionκ(A)\kappa(A)GD ConvergenceNumerical Effect
Well-conditioned1\sim 1--1010Fast, direct path to minimumResults accurate to ϵmach\sim \epsilon_{\text{mach}}
Moderately ill-conditioned10310^3--10610^6Oscillation, slow progressLose 3--6 digits of accuracy
Severely ill-conditioned>1016> 10^{16}Effectively stuckResults may be meaningless in FP64
**Preconditioning.** The convergence rate of gradient descent on a quadratic $f(x) = \frac{1}{2}x^\top Ax - b^\top x$ depends on the condition number $\kappa$ of $A$: GD converges as $\left(\frac{\kappa - 1}{\kappa + 1}\right)^t$. A **preconditioner** $M \approx A^{-1}$ transforms the system to $MAx = Mb$, reducing the effective condition number to $\kappa(MA) \approx 1$. Adam's per-parameter learning rates act as a diagonal preconditioner: dividing by $\sqrt{\hat{v}_t}$ approximates the inverse diagonal of the Hessian, reducing the effective condition number.

Rayleigh Quotient

For a symmetric matrix $A$ and nonzero vector $x$, the **Rayleigh quotient** is:

R(x)=xAxxxR(x) = \frac{x^\top A x}{x^\top x}

The Rayleigh quotient satisfies λminR(x)λmax\lambda_{\min} \leq R(x) \leq \lambda_{\max} for all x0x \neq 0, with equality at the corresponding eigenvectors. The extremal eigenvalues are:

λmax=maxx0R(x),λmin=minx0R(x)\lambda_{\max} = \max_{x \neq 0} R(x), \qquad \lambda_{\min} = \min_{x \neq 0} R(x)

PCA can be formulated as maximizing the Rayleigh quotient: the first principal component is $v_1 = \arg\max_{\|v\|=1} v^\top \Sigma v = \arg\max_{\|v\|=1} R_\Sigma(v)$, which is the eigenvector of $\Sigma$ with the largest eigenvalue (direction of maximum variance).

Gershgorin Circle Theorem

Every eigenvalue of a matrix $A \in \mathbb{R}^{n \times n}$ lies within at least one **Gershgorin disc**:

λi=1nD(Aii,Ri)whereRi=jiAij\lambda \in \bigcup_{i=1}^{n} D(A_{ii}, R_i) \quad \text{where} \quad R_i = \sum_{j \neq i} |A_{ij}|

Each disc D(Aii,Ri)D(A_{ii}, R_i) is centered at the diagonal entry AiiA_{ii} with radius equal to the sum of absolute values of the off-diagonal entries in row ii.

**Gershgorin in ML.** This theorem provides cheap eigenvalue bounds without computing the eigendecomposition. For a diagonally dominant matrix (where $|A_{ii}| > R_i$ for all $i$), Gershgorin guarantees all eigenvalues have the same sign as the diagonal entries -- hence the matrix is positive definite if all diagonal entries are positive and the matrix is diagonally dominant. This is useful for verifying that a kernel matrix or regularized Hessian is PD without computing eigenvalues.

Notation Summary

SymbolMeaning
λ,λi\lambda, \lambda_iEigenvalue(s)
v,viv, v_iEigenvector(s)
VVMatrix of eigenvectors (columns)
Λ\LambdaDiagonal matrix of eigenvalues
QQOrthogonal matrix of eigenvectors (QQ=IQ^\top Q = I)
κ(A)\kappa(A)Condition number
R(x)R(x)Rayleigh quotient
PD, PSDPositive definite, positive semi-definite
A0A \succ 0AA is positive definite
A0A \succeq 0AA is positive semi-definite
HHHessian matrix
LLCholesky factor (A=LLA = LL^\top)