Linear algebra is the mathematical foundation of machine learning. Every operation in a neural network -- matrix multiplications in linear layers, dot products in attention, projections in dimensionality reduction -- is a linear algebra operation. The objects are vectors (representing data points, embeddings, gradients) and matrices (representing transformations, weight matrices, covariance structures). This chapter introduces these objects, the operations defined on them, and the geometric intuitions that make them powerful tools for ML.
A **vector space** $V$ over $\mathbb{R}$ is a set equipped with two operations -- vector addition and scalar multiplication -- satisfying the following axioms:
Closure:u+v∈V and αv∈V for all u,v∈V, α∈R
Associativity:(u+v)+w=u+(v+w)
Commutativity:u+v=v+u
Additive identity: There exists 0∈V such that v+0=v
Additive inverse: For each v, there exists −v such that v+(−v)=0
Scalar distributivity:α(u+v)=αu+αv
Vector distributivity:(α+β)v=αv+βv
Scalar associativity:α(βv)=(αβ)v
Scalar identity:1⋅v=v
The standard example is Rn -- the space of n-tuples of real numbers.
In machine learning, a data point is a vector x∈Rn where n is the number of features. A dataset of m points is a matrix X∈Rm×n where each row is a data point.
A **subspace** $W \subseteq V$ is a subset that is itself a vector space under the same operations. Equivalently, $W$ is a subspace iff it is closed under addition and scalar multiplication: for all $u, w \in W$ and $\alpha \in \mathbb{R}$, $u + w \in W$ and $\alpha w \in W$. Every subspace contains the zero vector.
**Column space as subspace.** The column space of a matrix $A \in \mathbb{R}^{m \times n}$ is $\text{col}(A) = \{Ax : x \in \mathbb{R}^n\}$, the set of all linear combinations of the columns of $A$. This is a subspace of $\mathbb{R}^m$ with dimension equal to $\text{rank}(A)$. In linear regression, the predicted values $\hat{y} = X\hat{\theta}$ always lie in the column space of $X$.
The **dot product** (Euclidean inner product) of two vectors $x, y \in \mathbb{R}^n$ is:
x⋅y=x⊤y=∑i=1nxiyi=∥x∥∥y∥cosθ
where θ is the angle between them. More generally, an inner product⟨⋅,⋅⟩ is any function V×V→R that is:
Bilinear: linear in each argument
Symmetric:⟨x,y⟩=⟨y,x⟩
Positive definite:⟨x,x⟩>0 for x=0
The dot product is the most fundamental operation in ML:
Cosine similarity:cosθ=∥x∥∥y∥x⊤y measures angular similarity between embeddings. Used for retrieval, clustering, and nearest-neighbor search.
Attention scores:score(q,k)=q⊤k/d is a scaled dot product between query and key vectors.
Kernel methods: Replace x⊤y with a kernel function k(x,y) that implicitly computes the dot product in a higher-dimensional feature space.
Linear layers:y=Wx+b computes n dot products (one per output dimension) between the weight rows and the input.
For a positive definite matrix $M$, the **Mahalanobis inner product** is $\langle x, y \rangle_M = x^\top M y$, and the induced norm is $\|x\|_M = \sqrt{x^\top M x}$. When $M = \Sigma^{-1}$ (inverse covariance), the Mahalanobis distance $\|x - \mu\|_{\Sigma^{-1}}$ measures how many "standard deviations" $x$ is from $\mu$, accounting for correlations between features.
A **norm** $\|\cdot\|: V \to \mathbb{R}_{\geq 0}$ measures vector magnitude. It must satisfy:
Positive definiteness:∥x∥=0⟺x=0
Homogeneity:∥αx∥=∣α∣∥x∥
Triangle inequality:∥x+y∥≤∥x∥+∥y∥
The Lp norm is ∥x∥p=(∑i∣xi∣p)1/p for p≥1.
Norm
Formula
Unit Ball Shape
ML Application
L0 (counting)
∑i1[xi=0]
Cross-polytope corners
Sparsity (not a true norm)
L1 (Manhattan)
$\sum_i
x_i
$
L2 (Euclidean)
∑ixi2
Circle (sphere)
Ridge/L2 regularization, distances
L∞ (max)
$\max_i
x_i
$
Frobenius
∑i,jaij2
--
Matrix regularization, ∥A∥F2=tr(A⊤A)
Spectral
σmax(A)
--
Spectral normalization, Lipschitz bounds
**Norm equivalence.** In finite dimensions, all norms are equivalent up to constant factors: there exist $c, C > 0$ such that $c\|x\|_a \leq \|x\|_b \leq C\|x\|_a$. In $\mathbb{R}^n$: $\|x\|_2 \leq \|x\|_1 \leq \sqrt{n}\|x\|_2$ and $\|x\|_\infty \leq \|x\|_2 \leq \sqrt{n}\|x\|_\infty$. This means convergence in one norm implies convergence in all norms (in finite dimensions).
Vectors $x$ and $y$ are **orthogonal** ($x \perp y$) if $x^\top y = 0$. A set of vectors $\{v_1, \dots, v_k\}$ is:
- **Orthogonal** if $v_i^\top v_j = 0$ for all $i \neq j$
- **Orthonormal** if additionally $\|v_i\| = 1$, i.e., $v_i^\top v_j = \delta_{ij}$ (Kronecker delta)
The **orthogonal projection** of $b$ onto a vector $a$ is the closest point to $b$ on the line spanned by $a$:
proja(b)=a⊤aa⊤ba=∥a∥2a⊤ba
The residual b−proja(b) is orthogonal to a (this is the defining property of orthogonal projection).
More generally, the projection onto a subspace spanned by the columns of A:
If A has orthonormal columns (A⊤A=I): proj(b)=AA⊤b
For general A: proj(b)=A(A⊤A)−1A⊤b
The matrix P=A(A⊤A)−1A⊤ is the projection matrix. It satisfies P2=P (idempotent) and P⊤=P (symmetric).
**Linear regression as projection.** The normal equation $\hat{\theta} = (X^\top X)^{-1} X^\top y$ computes the parameters such that $\hat{y} = X\hat{\theta}$ is the orthogonal projection of $y$ onto the column space of $X$. The residual $y - \hat{y}$ is orthogonal to every column of $X$, meaning $X^\top(y - X\hat{\theta}) = 0$. This is the *least-squares* solution -- it minimizes $\|y - X\theta\|^2$.
**Gram--Schmidt process.** Given linearly independent vectors $\{a_1, \ldots, a_k\}$, Gram--Schmidt produces orthonormal vectors $\{q_1, \ldots, q_k\}$ by iteratively subtracting projections:
q~j=aj−∑i=1j−1(qi⊤aj)qi,qj=∥q~j∥q~j
This is the foundation of QR decomposition (A=QR), which is used for solving linear systems, eigenvalue computation, and orthogonalizing weight matrices.
Vectors $\{v_1, \dots, v_k\}$ are **linearly independent** if $\sum_{i=1}^k c_i v_i = 0$ implies $c_1 = c_2 = \cdots = c_k = 0$. Equivalently, no vector in the set can be written as a linear combination of the others.
A basis for a vector space V is a maximal linearly independent set that spans V. Every vector in V can be uniquely written as a linear combination of basis vectors. The number of vectors in any basis is the dimensiondim(V).
The **rank** of a matrix $A \in \mathbb{R}^{m \times n}$ equals the dimension of its column space (equivalently, its row space). Key facts:
rank(A)≤min(m,n)
A is full rank iff rank(A)=min(m,n)
For square matrices: A is invertible ⟺A is full rank ⟺det(A)=0
Rank-nullity theorem:rank(A)+nullity(A)=n where nullity(A)=dim(ker(A))
The rank of a weight matrix $W \in \mathbb{R}^{m \times n}$ tells you the effective dimensionality of the transformation. LoRA [@hu2022lora] exploits the empirical observation that fine-tuning weight updates $\Delta W$ have low intrinsic rank: a rank-$r$ factorization $\Delta W = BA$ with $r \ll \min(m,n)$ captures most of the adaptation signal, reducing trainable parameters from $mn$ to $(m+n)r$.
For $A \in \mathbb{R}^{m \times n}$ with rank $r$:
Column spacecol(A)⊆Rm: dimension r. The range of A: {Ax:x∈Rn}
Row spacerow(A)=col(A⊤)⊆Rn: dimension r
Null spaceker(A)⊆Rn: dimension n−r. Solutions to Ax=0
Left null spaceker(A⊤)⊆Rm: dimension m−r
These subspaces satisfy: col(A)⊥ker(A⊤) and row(A)⊥ker(A). Together they partition Rm and Rn into orthogonal complements.
In ML, these subspaces appear naturally:
- The **column space** of $X$ contains all achievable predictions $\hat{y} = X\theta$.
- The **null space** of $X$ contains parameter directions that do not change the prediction: if $\theta_1 - \theta_2 \in \ker(X)$, then $X\theta_1 = X\theta_2$. Overparameterized models ($n > m$) have non-trivial null spaces, leading to infinitely many solutions with the same training loss.
- L2 regularization selects the **minimum norm** solution from this family: $\theta^* = \arg\min_{\theta: X\theta = y} \|\theta\|$.