Taxonomy of Approaches
The methods surveyed in this chapter organize along a single spine: what the randomness or transform acts on. This yields two clearly separated levels. The first level groups methods by algorithmic technique, the concrete randomized primitive applied to data, weights, or computation (families 1 through 5, plus family 9). The second level groups methods by analytical lens, the mathematical viewpoint used to study and approximate models rather than a distinct primitive (families 6 through 8). Each family below states its organizing principle, its flagship methods, and the central tradeoff it navigates.
Level one: algorithmic technique families
-
Random projections and dimensionality reduction: The Johnson-Lindenstrauss lemma [@johnson1984extensions, @achlioptas2003database] provides the theoretical foundation: high-dimensional data can be projected to O(epsilon^{-2} log n) dimensions while preserving all pairwise distances to within (1 +/- epsilon). This underlies random Fourier features for kernel approximation (Rahimi & Recht, 2007) (whose Fourier-domain interpretation is revisited as an analytical lens in family 6), compressed sensing [@donoho2006compressed, @candes2006robust], and structured random projections for efficient embedding. Tradeoff: projection dimension trades approximation accuracy against compute and memory, since the distortion bound shrinks only as the inverse square root of the number of dimensions.
-
Randomized regularization and training: Dropout (Srivastava et al., 2014), stochastic depth (Huang et al., 2016), DropBlock, random data augmentation, and stochastic gradient descent (Robbins & Monro, 1951) with its variants (Adam (Kingma & Ba, 2015), AdaGrad). These methods use randomness to prevent overfitting, improve generalization, and enable scalable optimization. The theoretical connections to ensemble learning (dropout as exponential ensembling) and Bayesian inference (dropout as approximate variational inference (Gal & Ghahramani, 2016)) provide principled justifications for these widely used techniques. Tradeoff: injected noise trades reduced overfitting and better generalization against slower convergence and added variance in the training signal.
-
Hashing for machine learning: Locality-sensitive hashing (LSH) (Indyk & Motwani, 1998), SimHash, MinHash, learning-to-hash methods, and the SLIDE algorithm (Chen et al., 2020). Hashing enables sub-linear similarity search and sparse computation, with applications in approximate nearest neighbor search (HNSW (Malkov & Yashunin, 2020), FAISS (Johnson et al., 2021)) and attention sparsification (Reformer (Kitaev et al., 2020)). Tradeoff: the number of hash tables and bits per code trades query speed against recall, since coarser buckets retrieve faster but miss more true neighbors.
-
Sketching and streaming: Count-Min Sketch (Cormode & Muthukrishnan, 2005), CountSketch (Charikar et al., 2004), streaming PCA (Frequent Directions (Liberty, 2013)), and gradient compression [@ivkin2019communication, @spring2019compressing]. These algorithms process data in a single pass with bounded memory, providing the algorithmic foundation for online learning, distributed training, and continual learning. Tradeoff: sketch size trades memory footprint against estimation error, since a smaller summary saves space but widens the confidence interval on each query.
-
Probabilistic data structures: Bloom filters (Bloom, 1970) for approximate set membership (treated in a dedicated deep-dive sub-page within this family), HyperLogLog (Flajolet et al., 2007) for cardinality estimation, reservoir sampling (Vitter, 1985) for maintaining uniform random samples from streams, and consistent hashing (Karger et al., 1997) for distributed data management. These structures provide bounded-error guarantees with dramatically reduced memory. Tradeoff: allocated capacity trades memory against false-positive or estimation-error rate, since shrinking the structure raises the chance of collisions and miscounts.
Level two: analytical lens families
-
Fourier and spectral perspectives: Fourier neural operators (FNO) (Li et al., 2021), AFNO (Guibas et al., 2022), FNet (Lee-Thorp et al., 2022), spectral bias analysis, and the Neural Tangent Kernel's spectral characterization. Random Fourier features (Rahimi & Recht, 2007), whose primary home is the random-projection family above, also belong here as a hybrid: their construction is a random projection, yet they are best understood through the Fourier lens as a Monte Carlo approximation of a shift-invariant kernel. The Fourier transform provides the mathematical bridge between spatial/temporal patterns and frequency-domain analysis of neural network behavior. Tradeoff: the number of retained frequency modes trades model expressiveness against compute and the risk of aliasing high-frequency detail.
-
Wavelet and multi-resolution analysis: Wavelet scattering networks (Bruna & Mallat, 2013), WaveMix, and multi-resolution training strategies. Wavelets provide time-frequency localization that the Fourier transform lacks, enabling analysis of signals with features at multiple scales, directly relevant to neural networks that must capture patterns at different levels of abstraction (Daubechies, 1992). Tradeoff: the Heisenberg uncertainty of the transform forces a choice between time resolution and frequency resolution at each scale, so no single decomposition is sharp in both.
-
Signal processing view of Transformers: Attention as adaptive filtering (Dong et al., 2021), positional encoding as frequency representation (RoPE (Su et al., 2024)), token mixing as convolution, and randomized attention approximations (Performer (Choromanski et al., 2021), Reformer, linear attention (Katharopoulos et al., 2020)). This perspective connects the Transformer architecture to the rich theory of adaptive signal processing. Tradeoff: approximating full attention as a linear or sparse filter trades quadratic cost for linear cost at the price of fidelity, since the approximation discards exact pairwise interactions.
Returning to technique: randomized linear algebra
- Randomized linear algebra: Randomized SVD [@halko2011finding, @martinsson2020randomized], Nystrom approximation, CUR decomposition (Clarkson & Woodruff, 2017), and matrix completion (Candes & Recht, 2009). This is an algorithmic-technique family in the spirit of families 1 through 5; it appears last because random projections (family 1) supply its core primitive. These algorithms compute approximate matrix factorizations in roughly O(mn k) time for target rank k, where classical deterministic SVD costs O(mn min(m,n)), enabling efficient low-rank approximation at scale. Tradeoff: the number of sampled columns or oversampling factor trades runtime against the accuracy of the recovered factorization, since a smaller sketch is faster but captures less of the spectrum.
These families are deeply interconnected. For example, Performer (category 8) uses random features (category 1) to approximate the attention kernel, and Reformer (category 8) uses locality-sensitive hashing (category 3) to identify relevant key-value pairs. The signal processing perspective (categories 6-8) provides the theoretical framework for understanding why these approximations work and when they fail. Random projections (category 1) underlie both sketching algorithms (category 4) and randomized linear algebra (category 9), as the JL lemma guarantees that random projections preserve the geometric structure needed for downstream computation.
Comparing the families, a single axis recurs across every tradeoff above: each method spends a budget (projection dimension, sketch size, hash bits, retained modes, sampled columns) to buy a controlled approximation error, and the families differ mainly in which resource they conserve. The technique families (1 through 5, 9) trade memory or compute for accuracy on a concrete primitive, while the analytical-lens families (6 through 8) trade fidelity of the model approximation for tractability. The practical choice among them therefore reduces to which resource is scarcest in a given setting (memory for sketching and probabilistic structures, latency for hashing, model expressiveness for spectral and Transformer approximations) rather than to any single dominant method.
References
- Burton H. Bloom (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM. ↗
- Joan Bruna, Stephane Mallat (2013). Invariant Scattering Convolution Networks. IEEE TPAMI. ↗
- Emmanuel J. Candes, Benjamin Recht (2009). Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics. ↗
- Moses Charikar, Kevin Chen, Martin Farach-Colton (2004). Finding Frequent Items in Data Streams. Theoretical Computer Science. ↗
- Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Re (2020). SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. MLSys. ↗
- Krzysztof Choromanski, Valerii Likhosherstov, David Dohan (2021). Rethinking Attention with Performers. ICLR. ↗
- Kenneth L. Clarkson, David P. Woodruff (2017). Low-Rank Approximation and Regression in Input Sparsity Time. JACM. ↗
- Graham Cormode, S. Muthukrishnan (2005). An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Journal of Algorithms. ↗
- Ingrid Daubechies (1992). Ten Lectures on Wavelets. SIAM. ↗
- Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas (2021). Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth. ICML. ↗
- Philippe Flajolet, Eric Fusy, Olivier Gandouet, Frederic Meunier (2007). HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm. DMTCS Proceedings. ↗
- Yarin Gal, Zoubin Ghahramani (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML. ↗
- John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro (2022). Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers. ICLR. ↗
- Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Q. Weinberger (2016). Deep Networks with Stochastic Depth. ECCV. ↗
- Piotr Indyk, Rajeev Motwani (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC. ↗
- Jeff Johnson, Matthijs Douze, Herve Jegou (2021). Billion-scale Similarity Search with GPUs. IEEE TBD. ↗
- David Karger, Eric Lehman, Tom Leighton (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. STOC. ↗
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Francois Fleuret (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML. ↗
- Diederik P. Kingma, Jimmy Ba (2015). Adam: A Method for Stochastic Optimization. ICLR. ↗
- Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya (2020). Reformer: The Efficient Transformer. ICLR. ↗
- James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL. ↗
- Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar (2021). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR. ↗
- Edo Liberty (2013). Simple and Deterministic Matrix Sketching. KDD. ↗
- Yu A. Malkov, Dmitry A. Yashunin (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI. ↗
- Ali Rahimi, Benjamin Recht (2007). Random Features for Large-Scale Kernel Machines. NeurIPS. ↗
- Herbert Robbins, Sutton Monro (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics. ↗
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR. ↗
- Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing. ↗
- Jeffrey S. Vitter (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software. ↗