Probabilistic Data Structures in Systems for AI
HyperLogLog for Deduplication
HyperLogLog (Flajolet et al., 2007) estimates the cardinality (number of distinct elements) of a multiset using O(log log n) space per counter. The algorithm works by hashing elements and observing the maximum number of leading zeros in the hash values -- more leading zeros indicate more distinct elements, with the estimator using harmonic means across multiple registers for accuracy. In AI data pipelines, HyperLogLog is used to estimate dataset diversity, detect near-duplicate documents in pre-training corpora, and monitor vocabulary coverage during tokenizer training. The standard error is approximately 1.04 / sqrt(m) where m is the number of registers, so 2^14 registers (16 KB of memory) achieve ~0.8% accuracy. The MinHash variant (Broder, 1997) enables efficient estimation of Jaccard similarity between document sets, powering deduplication pipelines that process trillions of tokens (Lee, 2022). These probabilistic data structures are essential building blocks for the data curation pipelines that underpin modern LLM training (Mitzenmacher & Upfal, 2017).
Reservoir Sampling for Data Selection
Vitter's reservoir sampling (Vitter, 1985) maintains a uniform random sample of size k from a stream of unknown length n, processing each element in O(1) time. The algorithm is beautifully simple: keep the first k elements; for the i-th element (i > k), include it in the reservoir with probability k/i, replacing a random existing element. The proof of correctness follows from the invariant that after processing i elements, each has probability k/i of being in the reservoir. In AI systems, reservoir sampling is used for:
- Maintaining replay buffers in continual learning (see Chapter 1), where balanced Experience Replay (ER) (Chaudhry et al., 2019) uses reservoir sampling to maintain representative examples from all previous tasks
- Online data selection during pre-training, enabling uniform sampling from data streams of unknown length
- Constructing evaluation sets from streaming data without storing the full stream
- Fair sampling for bias evaluation, ensuring all subgroups are proportionally represented
Weighted reservoir sampling (Efraimidis & Spirakis, 2006) extends the basic algorithm to non-uniform sampling, enabling prioritized experience replay where important examples are sampled with higher probability.
Consistent Hashing for Distributed Training
Consistent hashing (Karger et al., 1997) enables balanced distribution of data and computation across workers with minimal redistribution when workers are added or removed. When a worker joins or leaves, only O(1/n) of the keys need to be remapped, compared to O(n) for standard modular hashing. In distributed AI training, consistent hashing is used for data parallelism (assigning data shards to workers), expert assignment in MoE models (routing tokens to experts with balanced load), and distributed embedding tables (partitioning large embedding lookups across memory-constrained devices). Virtual nodes improve load balance further, reducing variance in partition sizes from O(n/k) to O(n/k * log k) for k workers.
References
- Andrei Z. Broder (1997). On the Resemblance and Containment of Documents. SEQUENCES.
- Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Robert Ajemian, Nicholas Piesco (2019). On Tiny Episodic Memories in Continual Learning. arXiv.
- Pavlos S. Efraimidis, Paul G. Spirakis (2006). Weighted Random Sampling with a Reservoir. Information Processing Letters.
- Philippe Flajolet, Eric Fusy, Olivier Gandouet, Frederic Meunier (2007). HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm. DMTCS Proceedings.
- David Karger, Eric Lehman, Tom Leighton (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. STOC.
- Katherine Lee (2022). Deduplicating Training Data Makes Language Models Better. ACL.
- Michael Mitzenmacher, Eli Upfal (2017). Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press.
- Jeffrey S. Vitter (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software.