Matrix Operations
Matrices are not just arrays of numbers -- they are representations of linear transformations. Every operation on a matrix (multiplication, inversion, transposition, decomposition) has both an algebraic definition and a geometric interpretation. Understanding both is essential: the algebra tells you how to compute, and the geometry tells you what the computation means. This chapter covers the core operations on matrices that underlie all of computational linear algebra and, by extension, all of deep learning.
Matrix Multiplication
This requires FLOPs ( multiplications and additions). Matrix multiplication is not commutative ( in general) but is associative () and distributive ().
Three Views of Matrix Multiplication
Understanding matrix multiplication from multiple perspectives is essential for ML:
| View | Description | Formula |
|---|---|---|
| Entry-wise | is the dot product of row of and column of | |
| Column-wise | Column of is a linear combination of columns of | |
| Outer product | is a sum of rank-1 matrices |
- (entry-wise view): entry is the dot-product similarity between query and key
- (column-wise view): each output row is a weighted average of value vectors, where the weights are the attention probabilities
- (outer product view): the output at position is built from rank-1 contributions from each value vector
Transpose and Symmetry
Key properties:
Inverse and Linear Systems
Key properties: , , .
| Factorization | Cost | When to use |
|---|---|---|
| LU decomposition () | General square systems | |
| Cholesky () | Symmetric positive definite (covariance, kernel) | |
| QR decomposition () | Least squares, better numerical stability | |
| Iterative (CG, GMRES) | per iteration | Large sparse systems |
Cholesky is faster than LU and numerically more stable for SPD matrices. Always prefer factorization over inversion.
If (SVD), then where replaces each nonzero with .
Rank
Properties:
- Weight updates during fine-tuning have low intrinsic rank (Hu et al., 2022). LoRA exploits this by parameterizing where , with , reducing parameters from to .
- Attention matrices often have rapidly decaying singular values, motivating low-rank attention approximations.
- Embeddings with latent factors naturally produce rank- matrices (e.g., matrix factorization in recommender systems).
Trace
Key properties:
- Cyclic permutation: (but in general)
- Sum of eigenvalues:
- Frobenius inner product:
- Frobenius norm:
- Linearity:
Determinant
Key properties:
- for
- (product of eigenvalues)
Computing directly is numerically unstable (the determinant can overflow or underflow). Instead, use the Cholesky factorization :
This is both numerically stable and efficient ( for the Cholesky factorization plus for the sum).
- costs FLOPs
- costs FLOPs
For , , , : costs , while costs -- a 500x difference. In ML, this arises when computing for a data matrix , weight matrix , and a vector (e.g., in Hessian-vector products): always compute first, then .
Special Matrix Types
| Type | Definition | Properties | ML Example |
|---|---|---|---|
| Diagonal | for | is diagonal; | Scaling, in eigendecomp |
| Orthogonal | Preserves norms and angles; | in SVD; rotation matrices | |
| Symmetric | Real eigenvalues; orthogonal eigenvectors | Covariance, Hessian, kernel matrices | |
| SPD | Symmetric all | Unique Cholesky; defines an inner product | Covariance, Fisher information |
| Sparse | Most entries are zero | Efficient storage and multiplication | Adjacency matrices, attention masks |
| Toeplitz | Constant along diagonals | Matmul via FFT in | 1D convolution |
| Block diagonal | Diagonal blocks, zeros elsewhere | Operations decompose per block | Multi-head attention parameters |
Notation Summary
| Symbol | Meaning |
|---|---|
| Transpose of | |
| Inverse of | |
| Moore--Penrose pseudoinverse | |
| or | identity matrix |
| Rank (dimension of column space) | |
| Trace (sum of diagonal elements) | |
| Determinant | |
| Singular values | |
| Eigenvalues | |
| Frobenius norm | |
| Spectral norm () | |
| Lower/upper triangular (LU decomposition) | |
| Orthogonal/upper triangular (QR decomposition) |