Taxonomy of Approaches
The methods surveyed in this chapter can be organized into nine interconnected families, spanning classical randomized algorithms, signal processing theory, and their applications in modern AI:
-
Random projections and dimensionality reduction: The Johnson-Lindenstrauss lemma [@johnson1984extensions, @achlioptas2003database] provides the theoretical foundation: high-dimensional data can be projected to O(epsilon^{-2} log n) dimensions while preserving all pairwise distances to within (1 +/- epsilon). This underlies random Fourier features for kernel approximation (Rahimi & Recht, 2007), compressed sensing [@donoho2006compressed, @candes2006robust], and structured random projections for efficient embedding.
-
Randomized regularization and training: Dropout (Srivastava et al., 2014), stochastic depth (Huang et al., 2016), DropBlock, random data augmentation, and stochastic gradient descent (Robbins & Monro, 1951) with its variants (Adam (Kingma & Ba, 2015), AdaGrad). These methods use randomness to prevent overfitting, improve generalization, and enable scalable optimization. The theoretical connections to ensemble learning (dropout as exponential ensembling) and Bayesian inference (dropout as approximate variational inference (Gal & Ghahramani, 2016)) provide principled justifications for these widely used techniques.
-
Hashing for machine learning: Locality-sensitive hashing (LSH) (Indyk & Motwani, 1998), SimHash, MinHash, learning-to-hash methods, and the SLIDE algorithm (Chen et al., 2020). Hashing enables sub-linear similarity search and sparse computation, with applications in approximate nearest neighbor search (HNSW (Malkov & Yashunin, 2020), FAISS (Johnson et al., 2021)) and attention sparsification (Reformer (Kitaev et al., 2020)).
-
Sketching and streaming: Count-Min Sketch (Cormode & Muthukrishnan, 2005), CountSketch (Charikar et al., 2004), streaming PCA (Frequent Directions (Liberty, 2013)), and gradient compression [@ivkin2019communication, @spring2019compressing]. These algorithms process data in a single pass with bounded memory, providing the algorithmic foundation for online learning, distributed training, and continual learning.
-
Probabilistic data structures: Bloom filters (Bloom, 1970) for approximate set membership, HyperLogLog (Flajolet et al., 2007) for cardinality estimation, reservoir sampling (Vitter, 1985) for maintaining uniform random samples from streams, and consistent hashing (Karger et al., 1997) for distributed data management. These structures provide bounded-error guarantees with dramatically reduced memory.
-
Fourier and spectral perspectives: Random Fourier features (Rahimi & Recht, 2007), Fourier neural operators (FNO) (Li et al., 2021), AFNO (Guibas et al., 2022), FNet (Lee-Thorp et al., 2022), spectral bias analysis, and the Neural Tangent Kernel's spectral characterization. The Fourier transform provides the mathematical bridge between spatial/temporal patterns and frequency-domain analysis of neural network behavior.
-
Wavelet and multi-resolution analysis: Wavelet scattering networks (Bruna & Mallat, 2013), WaveMix, and multi-resolution training strategies. Wavelets provide time-frequency localization that the Fourier transform lacks, enabling analysis of signals with features at multiple scales -- directly relevant to neural networks that must capture patterns at different levels of abstraction (Daubechies, 1992).
-
Signal processing view of Transformers: Attention as adaptive filtering (Dong et al., 2021), positional encoding as frequency representation (RoPE (Su et al., 2024)), token mixing as convolution, and randomized attention approximations (Performer (Choromanski et al., 2021), Reformer, linear attention (Katharopoulos et al., 2020)). This perspective connects the Transformer architecture to the rich theory of adaptive signal processing.
-
Randomized linear algebra: Randomized SVD [@halko2011finding, @martinsson2020randomized], Nystrom approximation, CUR decomposition (Clarkson & Woodruff, 2017), and matrix completion (Candes & Recht, 2009). These algorithms compute approximate matrix factorizations in time proportional to the matrix size (rather than the matrix rank cubed), enabling efficient low-rank approximation at scale.
These families are deeply interconnected. For example, Performer (category 8) uses random features (category 1) to approximate the attention kernel, and Reformer (category 8) uses locality-sensitive hashing (category 3) to identify relevant key-value pairs. The signal processing perspective (categories 6-8) provides the theoretical framework for understanding why these approximations work and when they fail. Random projections (category 1) underlie both sketching algorithms (category 4) and randomized linear algebra (category 9), as the JL lemma guarantees that random projections preserve the geometric structure needed for downstream computation.
References
- Burton H. Bloom (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM.
- Joan Bruna, Stephane Mallat (2013). Invariant Scattering Convolution Networks. IEEE TPAMI.
- Emmanuel J. Candes, Benjamin Recht (2009). Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics.
- Moses Charikar, Kevin Chen, Martin Farach-Colton (2004). Finding Frequent Items in Data Streams. Theoretical Computer Science.
- Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Re (2020). SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. MLSys.
- Krzysztof Choromanski, Valerii Likhosherstov, David Dohan (2021). Rethinking Attention with Performers. ICLR.
- Kenneth L. Clarkson, David P. Woodruff (2017). Low-Rank Approximation and Regression in Input Sparsity Time. JACM.
- Graham Cormode, S. Muthukrishnan (2005). An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Journal of Algorithms.
- Ingrid Daubechies (1992). Ten Lectures on Wavelets. SIAM.
- Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas (2021). Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth. ICML.
- Philippe Flajolet, Eric Fusy, Olivier Gandouet, Frederic Meunier (2007). HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm. DMTCS Proceedings.
- Yarin Gal, Zoubin Ghahramani (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML.
- John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro (2022). Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers. ICLR.
- Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Q. Weinberger (2016). Deep Networks with Stochastic Depth. ECCV.
- Piotr Indyk, Rajeev Motwani (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC.
- Jeff Johnson, Matthijs Douze, Herve Jegou (2021). Billion-scale Similarity Search with GPUs. IEEE TBD.
- David Karger, Eric Lehman, Tom Leighton (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. STOC.
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Francois Fleuret (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML.
- Diederik P. Kingma, Jimmy Ba (2015). Adam: A Method for Stochastic Optimization. ICLR.
- Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya (2020). Reformer: The Efficient Transformer. ICLR.
- James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL.
- Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar (2021). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR.
- Edo Liberty (2013). Simple and Deterministic Matrix Sketching. KDD.
- Yu A. Malkov, Dmitry A. Yashunin (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
- Ali Rahimi, Benjamin Recht (2007). Random Features for Large-Scale Kernel Machines. NeurIPS.
- Herbert Robbins, Sutton Monro (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR.
- Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing.
- Jeffrey S. Vitter (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software.