Skip to main content

Taxonomy of Approaches

The methods surveyed in this chapter organize along a single spine: what the randomness or transform acts on. This yields two clearly separated levels. The first level groups methods by algorithmic technique, the concrete randomized primitive applied to data, weights, or computation (families 1 through 5, plus family 9). The second level groups methods by analytical lens, the mathematical viewpoint used to study and approximate models rather than a distinct primitive (families 6 through 8). Each family below states its organizing principle, its flagship methods, and the central tradeoff it navigates.

Level one: algorithmic technique families

  1. Random projections and dimensionality reduction: The Johnson-Lindenstrauss lemma [@johnson1984extensions, @achlioptas2003database] provides the theoretical foundation: high-dimensional data can be projected to O(epsilon^{-2} log n) dimensions while preserving all pairwise distances to within (1 +/- epsilon). This underlies random Fourier features for kernel approximation (Rahimi & Recht, 2007) (whose Fourier-domain interpretation is revisited as an analytical lens in family 6), compressed sensing [@donoho2006compressed, @candes2006robust], and structured random projections for efficient embedding. Tradeoff: projection dimension trades approximation accuracy against compute and memory, since the distortion bound shrinks only as the inverse square root of the number of dimensions.

  2. Randomized regularization and training: Dropout (Srivastava et al., 2014), stochastic depth (Huang et al., 2016), DropBlock, random data augmentation, and stochastic gradient descent (Robbins & Monro, 1951) with its variants (Adam (Kingma & Ba, 2015), AdaGrad). These methods use randomness to prevent overfitting, improve generalization, and enable scalable optimization. The theoretical connections to ensemble learning (dropout as exponential ensembling) and Bayesian inference (dropout as approximate variational inference (Gal & Ghahramani, 2016)) provide principled justifications for these widely used techniques. Tradeoff: injected noise trades reduced overfitting and better generalization against slower convergence and added variance in the training signal.

  3. Hashing for machine learning: Locality-sensitive hashing (LSH) (Indyk & Motwani, 1998), SimHash, MinHash, learning-to-hash methods, and the SLIDE algorithm (Chen et al., 2020). Hashing enables sub-linear similarity search and sparse computation, with applications in approximate nearest neighbor search (HNSW (Malkov & Yashunin, 2020), FAISS (Johnson et al., 2021)) and attention sparsification (Reformer (Kitaev et al., 2020)). Tradeoff: the number of hash tables and bits per code trades query speed against recall, since coarser buckets retrieve faster but miss more true neighbors.

  4. Sketching and streaming: Count-Min Sketch (Cormode & Muthukrishnan, 2005), CountSketch (Charikar et al., 2004), streaming PCA (Frequent Directions (Liberty, 2013)), and gradient compression [@ivkin2019communication, @spring2019compressing]. These algorithms process data in a single pass with bounded memory, providing the algorithmic foundation for online learning, distributed training, and continual learning. Tradeoff: sketch size trades memory footprint against estimation error, since a smaller summary saves space but widens the confidence interval on each query.

  5. Probabilistic data structures: Bloom filters (Bloom, 1970) for approximate set membership (treated in a dedicated deep-dive sub-page within this family), HyperLogLog (Flajolet et al., 2007) for cardinality estimation, reservoir sampling (Vitter, 1985) for maintaining uniform random samples from streams, and consistent hashing (Karger et al., 1997) for distributed data management. These structures provide bounded-error guarantees with dramatically reduced memory. Tradeoff: allocated capacity trades memory against false-positive or estimation-error rate, since shrinking the structure raises the chance of collisions and miscounts.

Level two: analytical lens families

  1. Fourier and spectral perspectives: Fourier neural operators (FNO) (Li et al., 2021), AFNO (Guibas et al., 2022), FNet (Lee-Thorp et al., 2022), spectral bias analysis, and the Neural Tangent Kernel's spectral characterization. Random Fourier features (Rahimi & Recht, 2007), whose primary home is the random-projection family above, also belong here as a hybrid: their construction is a random projection, yet they are best understood through the Fourier lens as a Monte Carlo approximation of a shift-invariant kernel. The Fourier transform provides the mathematical bridge between spatial/temporal patterns and frequency-domain analysis of neural network behavior. Tradeoff: the number of retained frequency modes trades model expressiveness against compute and the risk of aliasing high-frequency detail.

  2. Wavelet and multi-resolution analysis: Wavelet scattering networks (Bruna & Mallat, 2013), WaveMix, and multi-resolution training strategies. Wavelets provide time-frequency localization that the Fourier transform lacks, enabling analysis of signals with features at multiple scales, directly relevant to neural networks that must capture patterns at different levels of abstraction (Daubechies, 1992). Tradeoff: the Heisenberg uncertainty of the transform forces a choice between time resolution and frequency resolution at each scale, so no single decomposition is sharp in both.

  3. Signal processing view of Transformers: Attention as adaptive filtering (Dong et al., 2021), positional encoding as frequency representation (RoPE (Su et al., 2024)), token mixing as convolution, and randomized attention approximations (Performer (Choromanski et al., 2021), Reformer, linear attention (Katharopoulos et al., 2020)). This perspective connects the Transformer architecture to the rich theory of adaptive signal processing. Tradeoff: approximating full attention as a linear or sparse filter trades quadratic cost for linear cost at the price of fidelity, since the approximation discards exact pairwise interactions.

Returning to technique: randomized linear algebra

  1. Randomized linear algebra: Randomized SVD [@halko2011finding, @martinsson2020randomized], Nystrom approximation, CUR decomposition (Clarkson & Woodruff, 2017), and matrix completion (Candes & Recht, 2009). This is an algorithmic-technique family in the spirit of families 1 through 5; it appears last because random projections (family 1) supply its core primitive. These algorithms compute approximate matrix factorizations in roughly O(mn k) time for target rank k, where classical deterministic SVD costs O(mn min(m,n)), enabling efficient low-rank approximation at scale. Tradeoff: the number of sampled columns or oversampling factor trades runtime against the accuracy of the recovered factorization, since a smaller sketch is faster but captures less of the spectrum.

These families are deeply interconnected. For example, Performer (category 8) uses random features (category 1) to approximate the attention kernel, and Reformer (category 8) uses locality-sensitive hashing (category 3) to identify relevant key-value pairs. The signal processing perspective (categories 6-8) provides the theoretical framework for understanding why these approximations work and when they fail. Random projections (category 1) underlie both sketching algorithms (category 4) and randomized linear algebra (category 9), as the JL lemma guarantees that random projections preserve the geometric structure needed for downstream computation.

Comparing the families, a single axis recurs across every tradeoff above: each method spends a budget (projection dimension, sketch size, hash bits, retained modes, sampled columns) to buy a controlled approximation error, and the families differ mainly in which resource they conserve. The technique families (1 through 5, 9) trade memory or compute for accuracy on a concrete primitive, while the analytical-lens families (6 through 8) trade fidelity of the model approximation for tractability. The practical choice among them therefore reduces to which resource is scarcest in a given setting (memory for sketching and probabilistic structures, latency for hashing, model expressiveness for spectral and Transformer approximations) rather than to any single dominant method.


References