Randomized Linear Algebra for AI
Randomized numerical linear algebra (RandNLA) has emerged as a powerful paradigm for computing approximate matrix decompositions at scale [@martinsson2020randomized, @woodruff2014sketching]. The core strategy is to use random projections to reduce the problem dimension, compute exact decompositions on the smaller problem, and lift the result back to the original space. This approach provides strong theoretical guarantees (additive or multiplicative error bounds) with computational savings that can be exponential.
Randomized SVD
Computing the exact Singular Value Decomposition of large matrices is O(min(mn^2, m^2n)) -- prohibitive for the massive matrices encountered in AI. Halko et al. (2011) (Halko et al., 2011) developed the randomized SVD, which computes an approximate rank-k SVD in O(mn * log(k)) time by projecting the matrix onto a random subspace of dimension k + p (where p is a small oversampling parameter, typically 5-10). The algorithm proceeds in three steps: (1) form a random matrix Omega of size n x (k+p), (2) compute the sketch Y = A * Omega and find an orthonormal basis Q for range(Y), and (3) compute the SVD of the small matrix B = Q^T * A and recover the approximate SVD of A. The oversampling parameter p provides robustness: with p = 5, the approximation error is within a factor (1 + 11*sqrt(k+p)*sqrt(min(m,n))) of optimal, and in practice the error is typically within 1% of the best rank-k approximation.
The randomized SVD is now the default algorithm for large-scale matrix factorization in scientific computing (it is the algorithm behind scikit-learn's TruncatedSVD and Facebook's PCA implementation for billion-scale embeddings). In AI, it is widely used for:
- Low-rank adaptation (LoRA initialization): LoRA's low-rank weight updates A*B are conceptually related to the randomized SVD's low-rank factorization. Initializing LoRA with the top-k SVD components of the pretrained weight matrix can provide a warm start.
- Model compression via low-rank factorization: Replacing a weight matrix W with its rank-k approximation U_k * Sigma_k * V_k^T reduces parameters from m*n to (m+n)*k, enabling compression ratios of 2-5x with minimal quality loss when the weight matrix has rapid singular value decay.
- Analyzing learned representations: PCA of hidden activations reveals the effective dimensionality and structure of learned representations. The randomized SVD enables this analysis on activations with millions of dimensions.
- Data preprocessing (PCA on large datasets): Computing PCA on datasets with millions of samples and thousands of features requires the randomized SVD -- exact SVD is infeasible at this scale.
Nystrom Approximation
The Nystrom method (Williams & Seeger, 2001) approximates a large kernel matrix by sampling a subset of its columns, computing the exact kernel on this subset, and extrapolating to the full matrix. For an n x n kernel matrix, Nystrom reduces computation from O(n^3) to O(n * k^2) where k is the number of sampled columns. Nystrom approximations are used in:
- Kernel learning with large datasets
- Gaussian process inference
- Spectral clustering
- Approximating attention matrices (Nystromformer (Xiong et al., 2021))
Xiong et al. (2021) (Xiong et al., 2021) proposed Nystromformer, which applies the Nystrom method to approximate the softmax attention matrix. By sampling a subset of "landmark" tokens and computing attention only with respect to these landmarks, Nystromformer achieves linear complexity while maintaining competitive performance.
Matrix Completion
Candes and Recht (2009) (Candes & Recht, 2009) showed that low-rank matrices can be exactly recovered from a small number of randomly observed entries, provided the matrix satisfies certain incoherence conditions. This result, known as exact matrix completion, has profound implications for recommendation systems (recovering a user-item rating matrix from sparse observations) and other collaborative filtering problems. Recht (2011) (Recht, 2011) provided a simpler proof and practical algorithms. The connection to randomized algorithms is fundamental: the observed entries must be randomly (not adversarially) selected, and the recovery algorithms use nuclear norm minimization or iterative SVD -- both connected to the randomized linear algebra tools described above.
CUR Decomposition
CUR decomposition factorizes a matrix M as M approximately equals C * U * R, where C consists of actual columns of M, R consists of actual rows of M, and U is a small linking matrix (Mahoney & Drineas, 2009). Unlike SVD, CUR preserves the sparsity and interpretability of the original matrix. Clarkson and Woodruff (2017) (Clarkson & Woodruff, 2017) showed that CUR decomposition can be computed in input-sparsity time (linear in the number of nonzeros), making it particularly efficient for sparse matrices common in recommendation systems and NLP. In AI, CUR decomposition has been applied to:
- Interpretable dimensionality reduction (selected features are actual data points)
- Model compression (selecting actual neurons/filters to keep)
- Data summarization (selecting representative examples)
References
- Emmanuel J. Candes, Benjamin Recht (2009). Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics.
- Kenneth L. Clarkson, David P. Woodruff (2017). Low-Rank Approximation and Regression in Input Sparsity Time. JACM.
- Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review.
- Michael W. Mahoney, Petros Drineas (2009). CUR Matrix Decompositions for Improved Data Analysis. PNAS.
- Benjamin Recht (2011). A Simpler Approach to Matrix Completion. JMLR.
- Christopher K.I. Williams, Matthias Seeger (2001). Using the Nystrom Method to Speed Up Kernel Machines. NeurIPS.
- Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty (2021). Nystromformer: A Nystrom-Based Algorithm for Approximating Self-Attention. AAAI.