Sketching & Streaming
Sketching algorithms maintain compact, lossy summaries of data that support approximate query answering. In the streaming model, data arrives one element at a time and the algorithm must process it in a single pass with memory much smaller than the data size. These algorithms are directly relevant to AI systems that must process massive datasets (pre-training data), operate in online settings (continual learning), or compress communication (distributed training).
Count-Min Sketch and CountSketch
The Count-Min Sketch (Cormode and Muthukrishnan, 2005) (Cormode & Muthukrishnan, 2005) and CountSketch (Charikar et al., 2004) (Charikar et al., 2004) are streaming data structures that maintain approximate frequency counts in sub-linear space. A Count-Min Sketch uses d hash functions and a d x w array of counters; to insert an element, it increments d counters (one per hash function). To query, it returns the minimum of the d counter values, which overestimates the true frequency by at most epsilon * ||f||_1 with probability at least 1 - delta, using O((1/epsilon) * log(1/delta)) space. CountSketch uses signed hash functions to achieve unbiased estimates with tighter concentration.
These have been adapted for machine learning in several ways:
- Gradient compression: Ivkin et al. (2019) (Ivkin et al., 2019) used CountSketch for compressing gradients in distributed training, transmitting only a sketch instead of the full gradient vector. CountSketch is the natural choice here because it is linear (and thus mergeable across workers) and supports L2-norm estimation, properties that Count-Min Sketch does not provide. This reduces communication cost from O(d) to O(k * log(d/k)) per worker per iteration, enabling efficient distributed training of models with billions of parameters. We discuss this and other gradient-sketching methods in detail in "Applications to Distributed Training" below.
- Feature hashing: Weinberger et al. (2009) (Weinberger et al., 2009) used hashing to reduce feature dimensionality while preserving inner products, a technique widely used in large-scale linear models and recommendation systems. Feature hashing maps high-dimensional sparse features to a lower-dimensional space using a hash function, avoiding the need to maintain an explicit feature dictionary.
CountSketch for Optimizer-State Compression
A complementary use of CountSketch targets memory rather than communication. Spring and Shrivastava (2019) (Spring & Shrivastava, 2019) applied Count-Sketches to compress the auxiliary state that adaptive optimizers maintain, such as the per-parameter moment estimates in Adam. Instead of storing a full-size moment vector for every parameter, they store the moments in a Count-Sketch and recover approximate values via the median estimator when applying updates. This reduces the optimizer's memory footprint (the authors report roughly 25% savings on large models) without communicating sketches between workers and with minimal impact on convergence. The key insight is that the moment vectors are approximately sparse: a small fraction of entries carry most of the magnitude, making them amenable to sketching. For a vector of dimension d, a Count-Sketch of size O(k log d) captures the top-k entries with high probability.
Streaming PCA and Frequent Directions
Streaming PCA algorithms maintain a low-rank approximation of a data matrix seen one row at a time, without storing the full matrix. Oja's algorithm (Oja, 1982) provides the simplest approach: maintain a single vector that is updated toward the top eigenvector using a stochastic power method. Extending to the top-k eigenvectors, block power iteration processes data in a single pass, maintaining a k-dimensional subspace approximation.
Liberty (2013) (Liberty, 2013) proposed Frequent Directions, a deterministic streaming algorithm for matrix sketching that maintains a small l x d matrix B such that for any unit vector x, ||Ax||^2 - ||Bx||^2 >= 0 and ||Ax||^2 - ||Bx||^2 <= 2||A||_F^2 / l. Frequent Directions processes each row in O(ld) time and uses O(ld) space, independent of the number of rows. This makes it suitable for online learning settings where data arrives continuously and batch PCA is infeasible. In the context of continual learning, Frequent Directions can maintain an evolving low-rank representation of the data seen so far, enabling methods like GPM (Saha et al., 2021) to track the representation subspace without storing all previous data. A limitation of these streaming estimators is that they assume the leading subspace is reasonably stable: under abrupt distribution shift the maintained sketch lags the current data, and the rank budget l must be set conservatively to bound the worst-case error.
Reservoir Sampling
Reservoir sampling (Vitter, 1985) maintains a uniform random sample of fixed size k from a stream of unknown length n. Vitter's algorithm processes each element in O(1) amortized time: for the i-th element (i > k), it replaces a random element in the reservoir with probability k/i. After processing n elements, each element has exactly probability k/n of being in the sample, regardless of when it arrived. The proof of correctness is elegant. By induction, after processing i elements, each element is in the reservoir with probability k/i. When the (i+1)-th element arrives, it enters the reservoir with probability k/(i+1), and each existing element stays with probability 1 - (k/(i+1)) * (1/k) = i/(i+1), giving a final probability of (k/i) * (i/(i+1)) = k/(i+1), maintaining the invariant.
In AI systems, reservoir sampling serves several critical functions:
- Replay buffers in continual learning: Maintaining a representative sample of all data seen so far for experience replay, ensuring that the replay buffer does not become biased toward recent tasks (Chapter 1). ER (Experience Replay) (Chaudhry et al., 2019) uses reservoir sampling as its buffer management strategy, and the theoretical guarantee of uniform sampling is what ensures that the replay distribution approximates the true data distribution.
- Data selection during pre-training: Sampling representative subsets of massive pre-training corpora for monitoring training progress and constructing evaluation sets. When training on trillion-token corpora, reservoir sampling enables maintaining a fixed-size evaluation set that is representative of all data seen so far, without knowing the total corpus size in advance.
- Online evaluation: Maintaining a running evaluation set from streaming data without storing the full data stream.
- Fair sampling: Ensuring that evaluation and training samples are representative of the data seen so far (note that uniform reservoir sampling assumes a fixed underlying distribution and does not automatically correct for distribution shift).
- Weighted reservoir sampling: Efraimidis and Spirakis (2006) (Efraimidis & Spirakis, 2006) extended reservoir sampling to weighted streams, where each element has a priority weight and the reservoir maintains a weighted random sample. This is useful in AI for importance-weighted replay, where some examples are more informative than others and should be replayed more frequently.
Applications to Distributed Training
The sketching and streaming algorithms described above find their most impactful application in distributed deep learning training, where communication between workers is often the primary bottleneck. For a model with d parameters and N workers, naive synchronous SGD requires communicating N * d floats per iteration. For a model like GPT-3 with 175 billion parameters and hundreds of workers, this amounts to terabytes of communication per iteration.
Gradient sketching addresses this by compressing each gradient to a sketch of size O(k * log(d/k)), where k is the number of significant gradient entries. The key insight is that gradients are often approximately sparse, with a small fraction of entries carrying most of the magnitude. The canonical instantiation of this idea is Sketched-SGD (Ivkin et al., 2019) (Ivkin et al., 2019): sketch the gradient with a CountSketch, identify the heavy entries from the sketch, and communicate only those entries (with their positions). The authors show that this preserves the convergence rate of SGD while substantially reducing per-iteration communication.
This idea connects to a broader family of communication-efficient training methods. Sketching is one such mechanism; a distinct line of work uses error-feedback quantization, as in DeepSpeed's 1-bit Adam (Tang et al., 2021) (Tang et al., 2021) and 0/1 Adam (Lu et al., 2023) (Lu et al., 2023), which compress communicated gradients via low-bit quantization rather than via sketching while reporting reduced communication volume at comparable convergence speed. The theoretical foundation shared across these approaches, namely that compressed stochastic gradients can converge at the same rate as exact stochastic gradients when the compression error is controlled, connects directly to the concentration inequalities and sketching guarantees from Section 5.2.
References
- Moses Charikar, Kevin Chen, Martin Farach-Colton (2004). Finding Frequent Items in Data Streams. Theoretical Computer Science. ↗
- Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Robert Ajemian, Nicholas Piesco (2019). On Tiny Episodic Memories in Continual Learning. arXiv. ↗
- Graham Cormode, S. Muthukrishnan (2005). An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Journal of Algorithms. ↗
- Pavlos S. Efraimidis, Paul G. Spirakis (2006). Weighted Random Sampling with a Reservoir. Information Processing Letters. ↗
- Nikita Ivkin, Daniel Rothchild, Enayat Ullah (2019). Communication-efficient Distributed SGD with Sketching. NeurIPS. ↗
- Edo Liberty (2013). Simple and Deterministic Matrix Sketching. KDD. ↗
- Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He (2023). Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. International Conference on Learning Representations (ICLR). ↗
- Erkki Oja (1982). Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology. ↗
- Gobinda Saha, Isha Garg, Kaushik Roy (2021). Gradient Projection Memory for Continual Learning. ICLR. ↗
- Ryan Spring, Anshumali Shrivastava (2019). Compressing Gradient Optimizers via Count-Sketches. ICML. ↗
- Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He (2021). 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. International Conference on Machine Learning (ICML). ↗
- Jeffrey S. Vitter (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software. ↗
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alexander Smola, Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. ICML. ↗