Skip to main content

Vectors and Matrices

Linear algebra is the mathematical foundation of machine learning. Every operation in a neural network -- matrix multiplications in linear layers, dot products in attention, projections in dimensionality reduction -- is a linear algebra operation. The objects are vectors (representing data points, embeddings, gradients) and matrices (representing transformations, weight matrices, covariance structures). This chapter introduces these objects, the operations defined on them, and the geometric intuitions that make them powerful tools for ML.

Vector Spaces

A **vector space** $V$ over $\mathbb{R}$ is a set equipped with two operations -- vector addition and scalar multiplication -- satisfying the following axioms:
  1. Closure: u+vVu + v \in V and αvV\alpha v \in V for all u,vVu, v \in V, αR\alpha \in \mathbb{R}
  2. Associativity: (u+v)+w=u+(v+w)(u + v) + w = u + (v + w)
  3. Commutativity: u+v=v+uu + v = v + u
  4. Additive identity: There exists 0V0 \in V such that v+0=vv + 0 = v
  5. Additive inverse: For each vv, there exists v-v such that v+(v)=0v + (-v) = 0
  6. Scalar distributivity: α(u+v)=αu+αv\alpha(u + v) = \alpha u + \alpha v
  7. Vector distributivity: (α+β)v=αv+βv(\alpha + \beta)v = \alpha v + \beta v
  8. Scalar associativity: α(βv)=(αβ)v\alpha(\beta v) = (\alpha\beta) v
  9. Scalar identity: 1v=v1 \cdot v = v

The standard example is Rn\mathbb{R}^n -- the space of nn-tuples of real numbers.

In machine learning, a data point is a vector xRnx \in \mathbb{R}^n where nn is the number of features. A dataset of mm points is a matrix XRm×nX \in \mathbb{R}^{m \times n} where each row is a data point.

A **subspace** $W \subseteq V$ is a subset that is itself a vector space under the same operations. Equivalently, $W$ is a subspace iff it is closed under addition and scalar multiplication: for all $u, w \in W$ and $\alpha \in \mathbb{R}$, $u + w \in W$ and $\alpha w \in W$. Every subspace contains the zero vector. **Column space as subspace.** The column space of a matrix $A \in \mathbb{R}^{m \times n}$ is $\text{col}(A) = \{Ax : x \in \mathbb{R}^n\}$, the set of all linear combinations of the columns of $A$. This is a subspace of $\mathbb{R}^m$ with dimension equal to $\text{rank}(A)$. In linear regression, the predicted values $\hat{y} = X\hat{\theta}$ always lie in the column space of $X$.

Dot Product and Inner Products

The **dot product** (Euclidean inner product) of two vectors $x, y \in \mathbb{R}^n$ is:

xy=xy=i=1nxiyi=xycosθx \cdot y = x^\top y = \sum_{i=1}^{n} x_i y_i = \|x\| \|y\| \cos \theta

where θ\theta is the angle between them. More generally, an inner product ,\langle \cdot, \cdot \rangle is any function V×VRV \times V \to \mathbb{R} that is:

  • Bilinear: linear in each argument
  • Symmetric: x,y=y,x\langle x, y \rangle = \langle y, x \rangle
  • Positive definite: x,x>0\langle x, x \rangle > 0 for x0x \neq 0
The dot product is the most fundamental operation in ML:
  • Cosine similarity: cosθ=xyxy\cos\theta = \frac{x^\top y}{\|x\|\|y\|} measures angular similarity between embeddings. Used for retrieval, clustering, and nearest-neighbor search.
  • Attention scores: score(q,k)=qk/d\text{score}(q, k) = q^\top k / \sqrt{d} is a scaled dot product between query and key vectors.
  • Kernel methods: Replace xyx^\top y with a kernel function k(x,y)k(x,y) that implicitly computes the dot product in a higher-dimensional feature space.
  • Linear layers: y=Wx+by = Wx + b computes nn dot products (one per output dimension) between the weight rows and the input.
For a positive definite matrix $M$, the **Mahalanobis inner product** is $\langle x, y \rangle_M = x^\top M y$, and the induced norm is $\|x\|_M = \sqrt{x^\top M x}$. When $M = \Sigma^{-1}$ (inverse covariance), the Mahalanobis distance $\|x - \mu\|_{\Sigma^{-1}}$ measures how many "standard deviations" $x$ is from $\mu$, accounting for correlations between features.

Norms

A **norm** $\|\cdot\|: V \to \mathbb{R}_{\geq 0}$ measures vector magnitude. It must satisfy:
  1. Positive definiteness: x=0    x=0\|x\| = 0 \iff x = 0
  2. Homogeneity: αx=αx\|\alpha x\| = |\alpha| \|x\|
  3. Triangle inequality: x+yx+y\|x + y\| \leq \|x\| + \|y\|

The LpL_p norm is xp=(ixip)1/p\|x\|_p = \left(\sum_i |x_i|^p\right)^{1/p} for p1p \geq 1.

NormFormulaUnit Ball ShapeML Application
L0L_0 (counting)i1[xi0]\sum_i \mathbf{1}[x_i \neq 0]Cross-polytope cornersSparsity (not a true norm)
L1L_1 (Manhattan)$\sum_ix_i$
L2L_2 (Euclidean)ixi2\sqrt{\sum_i x_i^2}Circle (sphere)Ridge/L2 regularization, distances
LL_\infty (max)$\max_ix_i$
Frobeniusi,jaij2\sqrt{\sum_{i,j} a_{ij}^2}--Matrix regularization, AF2=tr(AA)\|A\|_F^2 = \text{tr}(A^\top A)
Spectralσmax(A)\sigma_{\max}(A)--Spectral normalization, Lipschitz bounds
**Norm equivalence.** In finite dimensions, all norms are equivalent up to constant factors: there exist $c, C > 0$ such that $c\|x\|_a \leq \|x\|_b \leq C\|x\|_a$. In $\mathbb{R}^n$: $\|x\|_2 \leq \|x\|_1 \leq \sqrt{n}\|x\|_2$ and $\|x\|_\infty \leq \|x\|_2 \leq \sqrt{n}\|x\|_\infty$. This means convergence in one norm implies convergence in all norms (in finite dimensions).

Orthogonality and Projection

Vectors $x$ and $y$ are **orthogonal** ($x \perp y$) if $x^\top y = 0$. A set of vectors $\{v_1, \dots, v_k\}$ is: - **Orthogonal** if $v_i^\top v_j = 0$ for all $i \neq j$ - **Orthonormal** if additionally $\|v_i\| = 1$, i.e., $v_i^\top v_j = \delta_{ij}$ (Kronecker delta) The **orthogonal projection** of $b$ onto a vector $a$ is the closest point to $b$ on the line spanned by $a$:

proja(b)=abaaa=aba2a\text{proj}_a(b) = \frac{a^\top b}{a^\top a} a = \frac{a^\top b}{\|a\|^2} a

The residual bproja(b)b - \text{proj}_a(b) is orthogonal to aa (this is the defining property of orthogonal projection).

More generally, the projection onto a subspace spanned by the columns of AA:

  • If AA has orthonormal columns (AA=IA^\top A = I): proj(b)=AAb\text{proj}(b) = A A^\top b
  • For general AA: proj(b)=A(AA)1Ab\text{proj}(b) = A(A^\top A)^{-1}A^\top b

The matrix P=A(AA)1AP = A(A^\top A)^{-1}A^\top is the projection matrix. It satisfies P2=PP^2 = P (idempotent) and P=PP^\top = P (symmetric).

**Linear regression as projection.** The normal equation $\hat{\theta} = (X^\top X)^{-1} X^\top y$ computes the parameters such that $\hat{y} = X\hat{\theta}$ is the orthogonal projection of $y$ onto the column space of $X$. The residual $y - \hat{y}$ is orthogonal to every column of $X$, meaning $X^\top(y - X\hat{\theta}) = 0$. This is the *least-squares* solution -- it minimizes $\|y - X\theta\|^2$. **Gram--Schmidt process.** Given linearly independent vectors $\{a_1, \ldots, a_k\}$, Gram--Schmidt produces orthonormal vectors $\{q_1, \ldots, q_k\}$ by iteratively subtracting projections:

q~j=aji=1j1(qiaj)qi,qj=q~jq~j\tilde{q}_j = a_j - \sum_{i=1}^{j-1} (q_i^\top a_j) q_i, \qquad q_j = \frac{\tilde{q}_j}{\|\tilde{q}_j\|}

This is the foundation of QR decomposition (A=QRA = QR), which is used for solving linear systems, eigenvalue computation, and orthogonalizing weight matrices.

Linear Independence and Basis

Vectors $\{v_1, \dots, v_k\}$ are **linearly independent** if $\sum_{i=1}^k c_i v_i = 0$ implies $c_1 = c_2 = \cdots = c_k = 0$. Equivalently, no vector in the set can be written as a linear combination of the others.

A basis for a vector space VV is a maximal linearly independent set that spans VV. Every vector in VV can be uniquely written as a linear combination of basis vectors. The number of vectors in any basis is the dimension dim(V)\dim(V).

The **rank** of a matrix $A \in \mathbb{R}^{m \times n}$ equals the dimension of its column space (equivalently, its row space). Key facts:
  • rank(A)min(m,n)\text{rank}(A) \leq \min(m, n)
  • AA is full rank iff rank(A)=min(m,n)\text{rank}(A) = \min(m, n)
  • For square matrices: AA is invertible     \iff AA is full rank     \iff det(A)0\det(A) \neq 0
  • Rank-nullity theorem: rank(A)+nullity(A)=n\text{rank}(A) + \text{nullity}(A) = n where nullity(A)=dim(ker(A))\text{nullity}(A) = \dim(\ker(A))
The rank of a weight matrix $W \in \mathbb{R}^{m \times n}$ tells you the effective dimensionality of the transformation. LoRA [@hu2022lora] exploits the empirical observation that fine-tuning weight updates $\Delta W$ have low intrinsic rank: a rank-$r$ factorization $\Delta W = BA$ with $r \ll \min(m,n)$ captures most of the adaptation signal, reducing trainable parameters from $mn$ to $(m+n)r$.

The Four Fundamental Subspaces

For $A \in \mathbb{R}^{m \times n}$ with rank $r$:
  1. Column space col(A)Rm\text{col}(A) \subseteq \mathbb{R}^m: dimension rr. The range of AA: {Ax:xRn}\{Ax : x \in \mathbb{R}^n\}
  2. Row space row(A)=col(A)Rn\text{row}(A) = \text{col}(A^\top) \subseteq \mathbb{R}^n: dimension rr
  3. Null space ker(A)Rn\ker(A) \subseteq \mathbb{R}^n: dimension nrn - r. Solutions to Ax=0Ax = 0
  4. Left null space ker(A)Rm\ker(A^\top) \subseteq \mathbb{R}^m: dimension mrm - r

These subspaces satisfy: col(A)ker(A)\text{col}(A) \perp \ker(A^\top) and row(A)ker(A)\text{row}(A) \perp \ker(A). Together they partition Rm\mathbb{R}^m and Rn\mathbb{R}^n into orthogonal complements.

In ML, these subspaces appear naturally: - The **column space** of $X$ contains all achievable predictions $\hat{y} = X\theta$. - The **null space** of $X$ contains parameter directions that do not change the prediction: if $\theta_1 - \theta_2 \in \ker(X)$, then $X\theta_1 = X\theta_2$. Overparameterized models ($n > m$) have non-trivial null spaces, leading to infinitely many solutions with the same training loss. - L2 regularization selects the **minimum norm** solution from this family: $\theta^* = \arg\min_{\theta: X\theta = y} \|\theta\|$.

Notation Summary

SymbolMeaning
Rn\mathbb{R}^nnn-dimensional real vector space
xyx^\top yDot product (inner product)
x,y\langle x, y \rangleGeneral inner product
xp\|x\|_pLpL_p norm
AF\|A\|_FFrobenius norm
δij\delta_{ij}Kronecker delta: 11 if i=ji=j, 00 otherwise
proja(b)\text{proj}_a(b)Projection of bb onto aa
PPProjection matrix (idempotent: P2=PP^2 = P)
col(A)\text{col}(A)Column space of AA
ker(A)\ker(A)Null space (kernel) of AA
rank(A)\text{rank}(A)Rank of AA
dim(V)\dim(V)Dimension of vector space VV