Taxonomy of Approaches

The methods surveyed in this chapter can be organized into nine interconnected families, spanning classical randomized algorithms, signal processing theory, and their applications in modern AI:

Random projections and dimensionality reduction: The Johnson-Lindenstrauss lemma [@johnson1984extensions, @achlioptas2003database] provides the theoretical foundation: high-dimensional data can be projected to O(epsilon^{-2} log n) dimensions while preserving all pairwise distances to within (1 +/- epsilon). This underlies random Fourier features for kernel approximation (Rahimi & Recht, 2007), compressed sensing [@donoho2006compressed, @candes2006robust], and structured random projections for efficient embedding.
Randomized regularization and training: Dropout (Srivastava et al., 2014), stochastic depth (Huang et al., 2016), DropBlock, random data augmentation, and stochastic gradient descent (Robbins & Monro, 1951) with its variants (Adam (Kingma & Ba, 2015), AdaGrad). These methods use randomness to prevent overfitting, improve generalization, and enable scalable optimization. The theoretical connections to ensemble learning (dropout as exponential ensembling) and Bayesian inference (dropout as approximate variational inference (Gal & Ghahramani, 2016)) provide principled justifications for these widely used techniques.
Hashing for machine learning: Locality-sensitive hashing (LSH) (Indyk & Motwani, 1998), SimHash, MinHash, learning-to-hash methods, and the SLIDE algorithm (Chen et al., 2020). Hashing enables sub-linear similarity search and sparse computation, with applications in approximate nearest neighbor search (HNSW (Malkov & Yashunin, 2020), FAISS (Johnson et al., 2021)) and attention sparsification (Reformer (Kitaev et al., 2020)).
Sketching and streaming: Count-Min Sketch (Cormode & Muthukrishnan, 2005), CountSketch (Charikar et al., 2004), streaming PCA (Frequent Directions (Liberty, 2013)), and gradient compression [@ivkin2019communication, @spring2019compressing]. These algorithms process data in a single pass with bounded memory, providing the algorithmic foundation for online learning, distributed training, and continual learning.
Probabilistic data structures: Bloom filters (Bloom, 1970) for approximate set membership, HyperLogLog (Flajolet et al., 2007) for cardinality estimation, reservoir sampling (Vitter, 1985) for maintaining uniform random samples from streams, and consistent hashing (Karger et al., 1997) for distributed data management. These structures provide bounded-error guarantees with dramatically reduced memory.
Fourier and spectral perspectives: Random Fourier features (Rahimi & Recht, 2007), Fourier neural operators (FNO) (Li et al., 2021), AFNO (Guibas et al., 2022), FNet (Lee-Thorp et al., 2022), spectral bias analysis, and the Neural Tangent Kernel's spectral characterization. The Fourier transform provides the mathematical bridge between spatial/temporal patterns and frequency-domain analysis of neural network behavior.
Wavelet and multi-resolution analysis: Wavelet scattering networks (Bruna & Mallat, 2013), WaveMix, and multi-resolution training strategies. Wavelets provide time-frequency localization that the Fourier transform lacks, enabling analysis of signals with features at multiple scales -- directly relevant to neural networks that must capture patterns at different levels of abstraction (Daubechies, 1992).
Signal processing view of Transformers: Attention as adaptive filtering (Dong et al., 2021), positional encoding as frequency representation (RoPE (Su et al., 2024)), token mixing as convolution, and randomized attention approximations (Performer (Choromanski et al., 2021), Reformer, linear attention (Katharopoulos et al., 2020)). This perspective connects the Transformer architecture to the rich theory of adaptive signal processing.
Randomized linear algebra: Randomized SVD [@halko2011finding, @martinsson2020randomized], Nystrom approximation, CUR decomposition (Clarkson & Woodruff, 2017), and matrix completion (Candes & Recht, 2009). These algorithms compute approximate matrix factorizations in time proportional to the matrix size (rather than the matrix rank cubed), enabling efficient low-rank approximation at scale.

These families are deeply interconnected. For example, Performer (category 8) uses random features (category 1) to approximate the attention kernel, and Reformer (category 8) uses locality-sensitive hashing (category 3) to identify relevant key-value pairs. The signal processing perspective (categories 6-8) provides the theoretical framework for understanding why these approximations work and when they fail. Random projections (category 1) underlie both sketching algorithms (category 4) and randomized linear algebra (category 9), as the JL lemma guarantees that random projections preserve the geometric structure needed for downstream computation.

References