Skip to main content

Introduction & Motivation

Deep learning is often presented as a purely data-driven enterprise -- feed enough data into a sufficiently large neural network and intelligence emerges. But beneath the surface, many of the field's most effective techniques are rooted in classical randomized algorithms, probabilistic data structures, and signal processing theory developed decades before the deep learning revolution. Understanding these connections is not merely an academic exercise -- it reveals principled ways to design more efficient, theoretically grounded, and interpretable AI systems [@mitzenmacher2017probability, @vershynin2018high, @woodruff2014sketching].

The intellectual roots of this intersection run deep. Randomized algorithms emerged as a major paradigm in theoretical computer science in the 1970s with Rabin's randomized primality testing and Schwartz-Zippel's polynomial identity testing, establishing the surprising power of random coin flips for solving computational problems. The probabilistic method, developed by Erdos and refined by Alon and Spencer (Alon & Spencer, 2016), showed that random constructions often achieve properties that are difficult to construct deterministically. These ideas migrated into machine learning through multiple channels: random features for kernel approximation (Rahimi & Recht, 2007) bridged the gap between theoretical kernel methods and practical large-scale learning; randomized matrix decompositions (Halko et al., 2011) enabled PCA and SVD on datasets too large for classical algorithms; and locality-sensitive hashing (Indyk & Motwani, 1998) solved the high-dimensional similarity search problem that underpins modern retrieval systems.

Randomized algorithms provide approximation guarantees that often match the practical performance of exact methods while being orders of magnitude cheaper. The Johnson-Lindenstrauss lemma tells us that high-dimensional data can be projected to much lower dimensions while preserving pairwise distances; locality-sensitive hashing enables sub-linear similarity search in billion-scale databases; sketching algorithms summarize streaming data in bounded memory with provable error bounds. These classical tools, developed in the algorithms and theoretical computer science communities, are now finding new life as critical components of modern AI systems -- from the random feature approximations that underlie efficient attention (Performer), to the hashing-based sparse computation that replaces dense matrix multiplications (SLIDE), to the reservoir sampling that manages replay buffers in continual learning [@achlioptas2003database, @liu2022random].

Signal processing theory offers a complementary lens for understanding neural networks. The field originated with Shannon's information theory (1948) and the development of the Fast Fourier Transform by Cooley and Tukey (1965), which reduced the DFT from O(n^2) to O(n log n) and arguably remains the most impactful algorithm in scientific computing. Wavelets, developed by Daubechies, Mallat, and Meyer in the 1980s-90s [@daubechies1992ten, @mallat1989theory], extended Fourier analysis to provide simultaneous time-frequency localization -- a property that turns out to be deeply relevant to how neural networks process information at multiple scales. Transformers can be analyzed as learned adaptive filter banks; positional encodings are frequency representations closely related to Fourier bases; attention is a form of data-dependent filtering analogous to Wiener filtering in classical signal processing. The spectral theory of neural networks -- analyzing what frequencies networks learn, and in what order, and why -- provides insights into generalization (why networks generalize despite overparameterization), optimization (why certain architectures train better), and architecture design (how to inject appropriate inductive biases) [@mallat2016understanding, @dong2021attention, @tancik2020fourier].

The importance of these connections is growing for four reasons:

  1. Scalability demands. As models grow to trillions of parameters and training data to trillions of tokens, exact computation becomes increasingly impractical. Randomized algorithms provide the mathematical toolkit for principled approximation -- replacing exact operations with approximate ones that have provable error bounds and dramatically lower computational cost.

  2. Efficiency through structure. Understanding the mathematical structure of neural network computations -- that attention matrices are low-rank, that gradients are approximately sparse, that embeddings have Gaussian-like distributions -- enables the design of specialized algorithms that exploit this structure. Randomized linear algebra, sketching, and hashing are the tools for exploiting these structural properties.

  3. Theoretical understanding. Random matrix theory, high-dimensional probability, and spectral analysis provide the mathematical language for understanding why neural networks work, when they fail, and how to improve them. These theoretical tools, developed for analyzing random structures, apply to neural networks because networks at initialization are random objects, and their behavior during training can be understood through the lens of random perturbation theory.

  4. Systems design. Building practical AI systems -- training pipelines, serving infrastructure, data processing -- requires the same algorithmic toolkit that underpins classical systems: hashing for indexing, sketching for streaming, sampling for data selection, and probabilistic data structures for membership testing and cardinality estimation. The performance of modern AI systems depends critically on these classical algorithmic components.

This chapter surveys these connections, organized into three broad themes: randomized algorithms applied to AI (random projections, hashing, sketching, randomized linear algebra), signal processing perspectives on neural networks (Fourier and spectral analysis, wavelets, attention as filtering), and randomized techniques in neural network training (dropout, stochastic depth, random initialization, data augmentation). Our goal is to bridge the conceptual gap between these classical fields and modern deep learning, providing practitioners with principled tools and researchers with theoretical perspectives that illuminate the foundations of AI efficiency.


References