Skip to main content

Matrix Operations

Matrices are not just arrays of numbers -- they are representations of linear transformations. Every operation on a matrix (multiplication, inversion, transposition, decomposition) has both an algebraic definition and a geometric interpretation. Understanding both is essential: the algebra tells you how to compute, and the geometry tells you what the computation means. This chapter covers the core operations on matrices that underlie all of computational linear algebra and, by extension, all of deep learning.

Matrix Multiplication

For $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$, the product $C = AB \in \mathbb{R}^{m \times n}$ is defined element-wise as:

Cij=l=1kAilBljC_{ij} = \sum_{l=1}^{k} A_{il} B_{lj}

This requires 2mkn2mkn FLOPs (mknmkn multiplications and mknmkn additions). Matrix multiplication is not commutative (ABBAAB \neq BA in general) but is associative ((AB)C=A(BC)(AB)C = A(BC)) and distributive (A(B+C)=AB+ACA(B+C) = AB + AC).

**Computational complexity of matrix multiplication.** The naive algorithm computes $C = AB$ in $O(mkn)$ time. Strassen's algorithm reduces this to $O(n^{2.807})$ for square matrices, though its constant factor and numerical stability make it impractical for most ML workloads. On modern hardware, the practical bottleneck is memory bandwidth, not FLOPs: the matrices must be loaded from HBM to SRAM, and the ratio of computation to data movement (arithmetic intensity) determines whether the operation is compute-bound or memory-bound. For large matrices (the common case in ML), matrix multiplication is compute-bound and achieves near-peak throughput on GPUs with Tensor Cores.

Three Views of Matrix Multiplication

Understanding matrix multiplication from multiple perspectives is essential for ML:

ViewDescriptionFormula
Entry-wiseCijC_{ij} is the dot product of row ii of AA and column jj of BBCij=aibjC_{ij} = a_i^\top b_j
Column-wiseColumn jj of CC is a linear combination of columns of AAC:,j=Abj=lBljA:,lC_{:,j} = A b_j = \sum_l B_{lj} A_{:,l}
Outer productCC is a sum of rank-1 matricesC=l=1kA:,lBl,:C = \sum_{l=1}^{k} A_{:,l} B_{l,:}^\top
**Attention as matrix multiplication.** Self-attention computes $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d}) \cdot V$.
  • QKQK^\top (entry-wise view): entry (i,j)(i,j) is the dot-product similarity between query ii and key jj
  • softmax(QK/d)V\text{softmax}(QK^\top/\sqrt{d}) \cdot V (column-wise view): each output row is a weighted average of value vectors, where the weights are the attention probabilities
  • jαijvj\sum_j \alpha_{ij} v_j^\top (outer product view): the output at position ii is built from rank-1 contributions from each value vector
A neural network layer $y = Wx + b$ is a matrix-vector product (or batched matrix multiplication $Y = XW^\top + \mathbf{1}b^\top$). The weight matrix $W$ defines a linear map: each row $w_i$ of $W$ computes one output feature as the dot product $w_i^\top x$. The transformation rotates, scales, and projects the input space to the output space.

Transpose and Symmetry

The **transpose** $A^\top$ swaps rows and columns: $(A^\top)_{ij} = A_{ji}$.

Key properties:

(AB)=BA,(A)=A,(A+B)=A+B,(cA)=cA(AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A, \quad (A + B)^\top = A^\top + B^\top, \quad (cA)^\top = cA^\top A square matrix $A$ is **symmetric** if $A = A^\top$, and **skew-symmetric** if $A = -A^\top$. Any square matrix can be uniquely decomposed as $A = \frac{1}{2}(A + A^\top) + \frac{1}{2}(A - A^\top)$ (symmetric + skew-symmetric parts). Important symmetric matrices in ML: - **Covariance matrix:** $\Sigma = \frac{1}{n-1}(X - \bar{X})^\top(X - \bar{X})$ is symmetric and positive semi-definite - **Gram matrix:** $G = X X^\top$ where $G_{ij} = x_i^\top x_j$ (similarities between data points) - **Hessian:** $H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial \theta_i \partial \theta_j}$ is symmetric (by Schwarz's theorem) - **Kernel matrix:** $K_{ij} = k(x_i, x_j)$ is symmetric and PSD for valid kernels

Inverse and Linear Systems

The **inverse** $A^{-1}$ of a square matrix $A \in \mathbb{R}^{n \times n}$ satisfies $A A^{-1} = A^{-1} A = I$. It exists iff $A$ is **non-singular** (equivalently: $\det(A) \neq 0$, $A$ is full rank, $\ker(A) = \{0\}$, all eigenvalues are nonzero).

Key properties: (AB)1=B1A1(AB)^{-1} = B^{-1} A^{-1}, (A)1=(A1)(A^\top)^{-1} = (A^{-1})^\top, det(A1)=1/det(A)\det(A^{-1}) = 1/\det(A).

**Never compute $A^{-1}$ explicitly.** To solve $Ax = b$:
FactorizationCostWhen to use
LU decomposition (A=LUA = LU)23n3\frac{2}{3}n^3General square systems
Cholesky (A=LLA = LL^\top)13n3\frac{1}{3}n^3Symmetric positive definite (covariance, kernel)
QR decomposition (A=QRA = QR)43n3\frac{4}{3}n^3Least squares, better numerical stability
Iterative (CG, GMRES)O(kn)O(kn) per iterationLarge sparse systems

Cholesky is 2×2\times faster than LU and numerically more stable for SPD matrices. Always prefer factorization over inversion.

For any matrix $A \in \mathbb{R}^{m \times n}$ (not necessarily square or invertible), the **pseudoinverse** $A^+ \in \mathbb{R}^{n \times m}$ is the unique matrix satisfying:
  1. AA+A=AAA^+A = A
  2. A+AA+=A+A^+AA^+ = A^+
  3. (AA+)=AA+(AA^+)^\top = AA^+
  4. (A+A)=A+A(A^+A)^\top = A^+A

If A=UΣVA = U\Sigma V^\top (SVD), then A+=VΣ+UA^+ = V\Sigma^+ U^\top where Σ+\Sigma^+ replaces each nonzero σi\sigma_i with 1/σi1/\sigma_i.

For overdetermined systems ($m > n$, more equations than unknowns), $A^+ b$ gives the least-squares solution $\arg\min_x \|Ax - b\|^2$. For underdetermined systems ($m < n$), it gives the minimum-norm solution $\arg\min_x \|x\| \text{ s.t. } Ax = b$. This is the solution that L2-regularized gradient descent converges to in overparameterized neural networks.

Rank

The **rank** of $A \in \mathbb{R}^{m \times n}$ is the dimension of its column space (equivalently, its row space):

rank(A)=dim(col(A))=dim(row(A))\text{rank}(A) = \dim(\text{col}(A)) = \dim(\text{row}(A))

Properties:

  • rank(A)min(m,n)\text{rank}(A) \leq \min(m, n)
  • rank(AB)min(rank(A),rank(B))\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))
  • rank(A+B)rank(A)+rank(B)\text{rank}(A + B) \leq \text{rank}(A) + \text{rank}(B)
  • rank(AA)=rank(AA)=rank(A)\text{rank}(A^\top A) = \text{rank}(A A^\top) = \text{rank}(A)
**Low-rank structure in ML.** Many matrices encountered in ML are approximately low-rank:
  • Weight updates during fine-tuning have low intrinsic rank (Hu et al., 2022). LoRA exploits this by parameterizing ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d} with rdr \ll d, reducing parameters from d2d^2 to 2dr2dr.
  • Attention matrices softmax(QK/d)\text{softmax}(QK^\top/\sqrt{d}) often have rapidly decaying singular values, motivating low-rank attention approximations.
  • Embeddings with rr latent factors naturally produce rank-rr matrices (e.g., matrix factorization in recommender systems).

Trace

The **trace** of a square matrix $A \in \mathbb{R}^{n \times n}$ is the sum of its diagonal elements:

tr(A)=i=1nAii\text{tr}(A) = \sum_{i=1}^{n} A_{ii}

Key properties:

  • Cyclic permutation: tr(ABC)=tr(CAB)=tr(BCA)\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA) (but tr(BAC)\neq \text{tr}(BAC) in general)
  • Sum of eigenvalues: tr(A)=iλi\text{tr}(A) = \sum_i \lambda_i
  • Frobenius inner product: tr(AB)=ijAijBij=A,BF\text{tr}(A^\top B) = \sum_{ij} A_{ij} B_{ij} = \langle A, B \rangle_F
  • Frobenius norm: AF2=tr(AA)=iσi2\|A\|_F^2 = \text{tr}(A^\top A) = \sum_i \sigma_i^2
  • Linearity: tr(αA+βB)=αtr(A)+βtr(B)\text{tr}(\alpha A + \beta B) = \alpha \text{tr}(A) + \beta \text{tr}(B)
The cyclic property $\text{tr}(AB) = \text{tr}(BA)$ is used constantly in ML derivations. For example, the MSE loss $\|Y - XW\|_F^2 = \text{tr}((Y-XW)^\top(Y-XW))$, and matrix calculus identities for gradients are most easily derived using trace notation.

Determinant

The **determinant** $\det(A)$ of a square matrix $A$ measures the signed volume scaling of the linear transformation. For $2 \times 2$: $\det \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$.

Key properties:

  • det(AB)=det(A)det(B)\det(AB) = \det(A)\det(B)
  • det(A1)=1/det(A)\det(A^{-1}) = 1/\det(A)
  • det(A)=det(A)\det(A^\top) = \det(A)
  • det(αA)=αndet(A)\det(\alpha A) = \alpha^n \det(A) for ARn×nA \in \mathbb{R}^{n \times n}
  • det(A)=i=1nλi\det(A) = \prod_{i=1}^{n} \lambda_i (product of eigenvalues)
**Log-determinant in ML.** The multivariate Gaussian log-likelihood contains $\log\det(\Sigma)$:

logp(xμ,Σ)=12(nlog(2π)+logdet(Σ)+(xμ)Σ1(xμ))\log p(x | \mu, \Sigma) = -\frac{1}{2}\left(n\log(2\pi) + \log\det(\Sigma) + (x-\mu)^\top \Sigma^{-1}(x-\mu)\right)

Computing logdet(Σ)\log\det(\Sigma) directly is numerically unstable (the determinant can overflow or underflow). Instead, use the Cholesky factorization Σ=LL\Sigma = LL^\top:

logdet(Σ)=logdet(LL)=2logdet(L)=2i=1nlogLii\log\det(\Sigma) = \log\det(LL^\top) = 2\log\det(L) = 2\sum_{i=1}^n \log L_{ii}

This is both numerically stable and efficient (O(n3/3)O(n^3/3) for the Cholesky factorization plus O(n)O(n) for the sum).

**Matrix multiplication order matters for efficiency.** Because matrix multiplication is associative, $(AB)C = A(BC)$, but the computational cost depends on the order. For $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{n \times p}$:
  • (AB)C(AB)C costs 2mkn+2mnp2mkn + 2mnp FLOPs
  • A(BC)A(BC) costs 2knp+2mkp2knp + 2mkp FLOPs

For m=1000m = 1000, k=10k = 10, n=1000n = 1000, p=1p = 1: (AB)C(AB)C costs 2×107+2×1032×1072 \times 10^7 + 2 \times 10^3 \approx 2 \times 10^7, while A(BC)A(BC) costs 2×104+2×104=4×1042 \times 10^4 + 2 \times 10^4 = 4 \times 10^4 -- a 500x difference. In ML, this arises when computing XWvX W v for a data matrix XX, weight matrix WW, and a vector vv (e.g., in Hessian-vector products): always compute (Wv)(Wv) first, then X(Wv)X(Wv).

Special Matrix Types

TypeDefinitionPropertiesML Example
DiagonalAij=0A_{ij} = 0 for iji \neq jA1A^{-1} is diagonal; det=aii\det = \prod a_{ii}Scaling, Λ\Lambda in eigendecomp
OrthogonalQQ=QQ=IQ^\top Q = QQ^\top = IPreserves norms and angles; det=±1\det = \pm 1U,VU, V in SVD; rotation matrices
SymmetricA=AA = A^\topReal eigenvalues; orthogonal eigenvectorsCovariance, Hessian, kernel matrices
SPDSymmetric ++ all λi>0\lambda_i > 0Unique Cholesky; defines an inner productCovariance, Fisher information
SparseMost entries are zeroEfficient storage and multiplicationAdjacency matrices, attention masks
ToeplitzConstant along diagonalsMatmul via FFT in O(nlogn)O(n \log n)1D convolution
Block diagonalDiagonal blocks, zeros elsewhereOperations decompose per blockMulti-head attention parameters

Notation Summary

SymbolMeaning
AA^\topTranspose of AA
A1A^{-1}Inverse of AA
A+A^+Moore--Penrose pseudoinverse
II or InI_nn×nn \times n identity matrix
rank(A)\text{rank}(A)Rank (dimension of column space)
tr(A)\text{tr}(A)Trace (sum of diagonal elements)
det(A)\det(A)Determinant
σi\sigma_iSingular values
λi\lambda_iEigenvalues
AF\|A\|_FFrobenius norm
A2\|A\|_2Spectral norm (=σmax= \sigma_{\max})
L,UL, ULower/upper triangular (LU decomposition)
Q,RQ, ROrthogonal/upper triangular (QR decomposition)

References