Matrix Operations

Matrices are not just arrays of numbers; they are representations of linear transformations. Every operation on a matrix (multiplication, inversion, transposition, decomposition) has both an algebraic definition and a geometric interpretation. Understanding both is essential: the algebra tells you how to compute, and the geometry tells you what the computation means. This section covers the core operations on matrices that underlie all of computational linear algebra and, by extension, all of deep learning.

Matrix Multiplication

For $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$, the product $C = AB \in \mathbb{R}^{m \times n}$ is defined element-wise as:

$C_{ij} = \sum_{l=1}^{k} A_{il} B_{lj}$

This requires $2mkn$ FLOPs ( $mkn$ multiplications and $mkn$ additions). Matrix multiplication is not commutative ( $AB \neq BA$ in general) but is associative ( $(AB)C = A(BC)$ ) and distributive ( $A(B+C) = AB + AC$ ).

**Computational complexity of matrix multiplication.** The naive algorithm computes $C = AB$ in $O(mkn)$ time. Strassen's algorithm reduces this to $O(n^{2.807})$ for square matrices, though its constant factor and numerical stability make it impractical for most ML workloads. On modern hardware, the practical bottleneck is memory bandwidth, not FLOPs: the matrices must be loaded from HBM to SRAM, and the ratio of computation to data movement (arithmetic intensity) determines whether the operation is compute-bound or memory-bound. For large matrices (the common case in ML), matrix multiplication is compute-bound and achieves near-peak throughput on GPUs with Tensor Cores.

Three Views of Matrix Multiplication

Understanding matrix multiplication from multiple perspectives is essential for ML:

View	Description	Formula
Entry-wise	$C_{ij}$ is the dot product of row $i$ of $A$ and column $j$ of $B$	$C_{ij} = a_i^\top b_j$
Column-wise	Column $j$ of $C$ is a linear combination of columns of $A$	$C_{:,j} = A b_j = \sum_l B_{lj} A_{:,l}$
Outer product	$C$ is a sum of rank-1 matrices	$C = \sum_{l=1}^{k} A_{:,l} B_{l,:}^\top$

**A worked product, step by step.** Let

A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \qquad B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}.

Using the entry-wise view, each $C_{ij}$ is the dot product of row $i$ of $A$ with column $j$ of $B$ :

$C_{11} = (1)(5) + (2)(7) = 5 + 14 = 19$
$C_{12} = (1)(6) + (2)(8) = 6 + 16 = 22$
$C_{21} = (3)(5) + (4)(7) = 15 + 28 = 43$
$C_{22} = (3)(6) + (4)(8) = 18 + 32 = 50$

So $C = AB = \begin{pmatrix} 19 & 22 \\ 43 & 50 \end{pmatrix}$ . Computing $BA$ instead gives $\begin{pmatrix} 23 & 34 \\ 31 & 46 \end{pmatrix} \neq AB$ , a concrete reminder that matrix multiplication is not commutative.

**Attention as matrix multiplication.** Self-attention computes $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d}) \cdot V$.

$QK^\top$ (entry-wise view): entry $(i,j)$ is the dot-product similarity between query $i$ and key $j$
$\text{softmax}(QK^\top/\sqrt{d}) \cdot V$ (column-wise view): each output row is a weighted average of value vectors, where the weights are the attention probabilities
$\sum_j \alpha_{ij} v_j^\top$ (outer product view): the output at position $i$ is built from rank-1 contributions from each value vector

A neural network layer $y = Wx + b$ is a matrix-vector product (or batched matrix multiplication $Y = XW^\top + \mathbf{1}b^\top$). The weight matrix $W$ defines a linear map: each row $w_i$ of $W$ computes one output feature as the dot product $w_i^\top x$. The transformation rotates, scales, and projects the input space to the output space.

Transpose and Symmetry

The **transpose** $A^\top$ swaps rows and columns: $(A^\top)_{ij} = A_{ji}$.

Key properties:

(AB)^\top = B^\top A^\top, \quad (A^\top)^\top = A, \quad (A + B)^\top = A^\top + B^\top, \quad (cA)^\top = cA^\top

A square matrix $A$ is **symmetric** if $A = A^\top$, and **skew-symmetric** if $A = -A^\top$. Any square matrix can be uniquely decomposed as $A = \frac{1}{2}(A + A^\top) + \frac{1}{2}(A - A^\top)$ (symmetric + skew-symmetric parts). Important symmetric matrices in ML: - **Covariance matrix:** $\Sigma = \frac{1}{n-1}(X - \bar{X})^\top(X - \bar{X})$ is symmetric and positive semi-definite - **Gram matrix:** $G = X X^\top$ where $G_{ij} = x_i^\top x_j$ (similarities between data points) - **Hessian:** $H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial \theta_i \partial \theta_j}$ is symmetric (by Schwarz's theorem) - **Kernel matrix:** $K_{ij} = k(x_i, x_j)$ is symmetric and PSD for valid kernels

Inverse and Linear Systems

The **inverse** $A^{-1}$ of a square matrix $A \in \mathbb{R}^{n \times n}$ satisfies $A A^{-1} = A^{-1} A = I$. It exists iff $A$ is **non-singular** (equivalently: $\det(A) \neq 0$, $A$ is full rank, $\ker(A) = \{0\}$, all eigenvalues are nonzero).

Key properties: $(AB)^{-1} = B^{-1} A^{-1}$ , $(A^\top)^{-1} = (A^{-1})^\top$ , $\det(A^{-1}) = 1/\det(A)$ .

**A worked $2 \times 2$ inverse.** For $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$ with $\det(A) = ad - bc \neq 0$, the inverse has the closed form $A^{-1} = \frac{1}{ad - bc}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}$. Apply this to $A = \begin{pmatrix} 4 & 3 \\ 6 & 3 \end{pmatrix}$:

Compute the determinant: $\det(A) = (4)(3) - (3)(6) = 12 - 18 = -6$ (nonzero, so the inverse exists).
Swap the diagonal entries, negate the off-diagonal entries: $\begin{pmatrix} 3 & -3 \\ -6 & 4 \end{pmatrix}$ .
Divide by the determinant: $A^{-1} = -\frac{1}{6}\begin{pmatrix} 3 & -3 \\ -6 & 4 \end{pmatrix} = \begin{pmatrix} -\tfrac{1}{2} & \tfrac{1}{2} \\ 1 & -\tfrac{2}{3} \end{pmatrix}$ .

As a check, $A A^{-1} = \begin{pmatrix} 4 & 3 \\ 6 & 3 \end{pmatrix}\begin{pmatrix} -\tfrac{1}{2} & \tfrac{1}{2} \\ 1 & -\tfrac{2}{3} \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = I$ , confirming the result. For $n > 2$ no such tidy formula is practical; use the factorizations in the next remark instead.

**Never compute $A^{-1}$ explicitly.** To solve $Ax = b$:

Factorization	Cost	When to use
LU decomposition ( $A = LU$ )	$\frac{2}{3}n^3$	General square systems
Cholesky ( $A = LL^\top$ )	$\frac{1}{3}n^3$	Symmetric positive definite (covariance, kernel)
QR decomposition ( $A = QR$ )	$\frac{4}{3}n^3$	Least squares, better numerical stability
Iterative (CG, GMRES)	$O(kn)$ per iteration	Large sparse systems

Cholesky is $2\times$ faster than LU and numerically more stable for SPD matrices. Always prefer factorization over inversion.

For any matrix $A \in \mathbb{R}^{m \times n}$ (not necessarily square or invertible), the **pseudoinverse** $A^+ \in \mathbb{R}^{n \times m}$ is the unique matrix satisfying:

$AA^+A = A$
$A^+AA^+ = A^+$
$(AA^+)^\top = AA^+$
$(A^+A)^\top = A^+A$

If $A = U\Sigma V^\top$ (SVD), then $A^+ = V\Sigma^+ U^\top$ where $\Sigma^+$ replaces each nonzero $\sigma_i$ with $1/\sigma_i$ .

For overdetermined systems ($m > n$, more equations than unknowns), $A^+ b$ gives the least-squares solution $\arg\min_x \|Ax - b\|^2$. For underdetermined systems ($m < n$), it gives the minimum-norm solution $\arg\min_x \|x\| \text{ s.t. } Ax = b$. This is the solution that L2-regularized gradient descent converges to in overparameterized neural networks.

Rank

The **rank** of $A \in \mathbb{R}^{m \times n}$ is the dimension of its column space (equivalently, its row space):

$\text{rank}(A) = \dim(\text{col}(A)) = \dim(\text{row}(A))$

Properties:

$\text{rank}(A) \leq \min(m, n)$
$\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))$
$\text{rank}(A + B) \leq \text{rank}(A) + \text{rank}(B)$
$\text{rank}(A^\top A) = \text{rank}(A A^\top) = \text{rank}(A)$

**Low-rank structure in ML.** Many matrices encountered in ML are approximately low-rank:

Weight updates during fine-tuning have low intrinsic rank (Hu et al., 2022). LoRA exploits this by parameterizing $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times d}$ with $r \ll d$ , reducing parameters from $d^2$ to $2dr$ .
Attention matrices $\text{softmax}(QK^\top/\sqrt{d})$ often have rapidly decaying singular values, motivating low-rank attention approximations.
Embeddings with $r$ latent factors naturally produce rank- $r$ matrices (e.g., matrix factorization in recommender systems).

Trace

The **trace** of a square matrix $A \in \mathbb{R}^{n \times n}$ is the sum of its diagonal elements:

$\text{tr}(A) = \sum_{i=1}^{n} A_{ii}$

Key properties:

Cyclic permutation: $\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$ (but $\neq \text{tr}(BAC)$ in general)
Sum of eigenvalues: $\text{tr}(A) = \sum_i \lambda_i$
Frobenius inner product: $\text{tr}(A^\top B) = \sum_{ij} A_{ij} B_{ij} = \langle A, B \rangle_F$
Frobenius norm: $\|A\|_F^2 = \text{tr}(A^\top A) = \sum_i \sigma_i^2$
Linearity: $\text{tr}(\alpha A + \beta B) = \alpha \text{tr}(A) + \beta \text{tr}(B)$

The cyclic property $\text{tr}(AB) = \text{tr}(BA)$ is used constantly in ML derivations. For example, the MSE loss $\|Y - XW\|_F^2 = \text{tr}((Y-XW)^\top(Y-XW))$, and matrix calculus identities for gradients are most easily derived using trace notation.

Determinant

The **determinant** $\det(A)$ of a square matrix $A$ measures the signed volume scaling of the linear transformation. For $2 \times 2$: $\det \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$.

Key properties:

$\det(AB) = \det(A)\det(B)$
$\det(A^{-1}) = 1/\det(A)$
$\det(A^\top) = \det(A)$
$\det(\alpha A) = \alpha^n \det(A)$ for $A \in \mathbb{R}^{n \times n}$
$\det(A) = \prod_{i=1}^{n} \lambda_i$ (product of eigenvalues)

**Log-determinant in ML.** The multivariate Gaussian log-likelihood contains $\log\det(\Sigma)$:

$\log p(x | \mu, \Sigma) = -\frac{1}{2}\left(n\log(2\pi) + \log\det(\Sigma) + (x-\mu)^\top \Sigma^{-1}(x-\mu)\right)$

Computing $\log\det(\Sigma)$ directly is numerically unstable (the determinant can overflow or underflow). Instead, use the Cholesky factorization $\Sigma = LL^\top$ :

$\log\det(\Sigma) = \log\det(LL^\top) = 2\log\det(L) = 2\sum_{i=1}^n \log L_{ii}$

This is both numerically stable and efficient ( $O(n^3/3)$ for the Cholesky factorization plus $O(n)$ for the sum).

**Matrix multiplication order matters for efficiency.** Because matrix multiplication is associative, $(AB)C = A(BC)$, but the computational cost depends on the order. For $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{n \times p}$:

$(AB)C$ costs $2mkn + 2mnp$ FLOPs
$A(BC)$ costs $2knp + 2mkp$ FLOPs

For $m = 1000$ , $k = 10$ , $n = 1000$ , $p = 1$ : $(AB)C$ costs $2 \times 10^7 + 2 \times 10^6 \approx 2 \times 10^7$ , while $A(BC)$ costs $2 \times 10^4 + 2 \times 10^4 = 4 \times 10^4$ (a 500x difference). In ML, this arises when computing $X W v$ for a data matrix $X$ , weight matrix $W$ , and a vector $v$ (e.g., in Hessian-vector products): always compute $(Wv)$ first, then $X(Wv)$ .

Special Matrix Types

Type	Definition	Properties	ML Example
Diagonal	$A_{ij} = 0$ for $i \neq j$	$A^{-1}$ is diagonal; $\det = \prod a_{ii}$	Scaling, $\Lambda$ in eigendecomp
Orthogonal	$Q^\top Q = QQ^\top = I$	Preserves norms and angles; $\det = \pm 1$	$U, V$ in SVD; rotation matrices
Symmetric	$A = A^\top$	Real eigenvalues; orthogonal eigenvectors	Covariance, Hessian, kernel matrices
SPD	Symmetric $+$ all $\lambda_i > 0$	Unique Cholesky; defines an inner product	Covariance, Fisher information
Sparse	Most entries are zero	Efficient storage and multiplication	Adjacency matrices, attention masks
Toeplitz	Constant along diagonals	Matmul via FFT in $O(n \log n)$	1D convolution
Block diagonal	Diagonal blocks, zeros elsewhere	Operations decompose per block	Multi-head attention parameters

Notation Summary

Symbol	Meaning
$A^\top$	Transpose of $A$
$A^{-1}$	Inverse of $A$
$A^+$	Moore-Penrose pseudoinverse
$I$ or $I_n$	$n \times n$ identity matrix
$\text{rank}(A)$	Rank (dimension of column space)
$\text{tr}(A)$	Trace (sum of diagonal elements)
$\det(A)$	Determinant
$\sigma_i$	Singular values
$\lambda_i$	Eigenvalues
$\\|A\\|_F$	Frobenius norm
$\\|A\\|_2$	Spectral norm ( $= \sigma_{\max}$ )
$L, U$	Lower/upper triangular (LU decomposition)
$Q, R$	Orthogonal/upper triangular (QR decomposition)

References

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR. ↗

Matrix Multiplication​

Three Views of Matrix Multiplication​

Transpose and Symmetry​

Inverse and Linear Systems​

Rank​

Trace​

Determinant​

Special Matrix Types​

Notation Summary​

References

Matrix Multiplication

Three Views of Matrix Multiplication

Transpose and Symmetry

Inverse and Linear Systems

Rank

Trace

Determinant

Special Matrix Types

Notation Summary