Skip to main content

Bloom Filters & Learned Indices

The Bloom filter (Bloom, 1970) is one of the most elegant probabilistic data structures: it uses kk hash functions to map elements into a bit array of mm bits, supporting set membership queries with zero false negatives and a controllable false positive rate. Under the standard assumption that the hash functions are independent and map elements uniformly at random over the bit array, the false positive rate is approximately (1ekn/m)k(1 - e^{-kn/m})^k, where nn is the number of inserted elements. For optimal k=(m/n)ln2k = (m/n) \ln 2, this approximation reduces to (1/2)k(1/2)^k, which decreases exponentially with bits per element. This remarkable space-accuracy tradeoff (answering "is x in S?" using only about 10 bits per element for a 1% false positive rate, compared to storing full elements which may require hundreds of bytes) makes Bloom filters foundational in systems for AI, used in deduplication pipelines for LLM pre-training data, caching layers in retrieval systems, and distributed training coordination.

Counting and Deletable Bloom Filters

Standard Bloom filters do not support element deletion (clearing a bit might affect other elements that hash to the same position). Counting Bloom filters replace each bit with a counter, supporting both insertion and deletion at the cost of increased memory. Cuckoo filters (Fan et al., 2014) provide an alternative that supports deletion while using less memory than counting Bloom filters and achieving lower false positive rates. In AI systems, deletable Bloom filters are useful for maintaining dynamic data indices, for example tracking which documents have been processed in a continual learning pipeline where documents may be retracted or updated.

Learned Bloom Filters

Kraska et al. (2018) (Kraska et al., 2018) proposed replacing traditional Bloom filters with learned models that predict set membership. A learned Bloom filter trains a small neural network to predict whether an element is in the set, backed by a smaller traditional Bloom filter to handle false negatives. The learned model can exploit distributional properties of the data (e.g., URLs that look like spam, email addresses from known domains) that traditional Bloom filters ignore; on a phishing-URL membership task, the learned design reportedly achieved roughly a 36% reduction in memory at a 1% false positive rate. The key insight is that when the data has learnable structure, a model can use far fewer bits per element than a data-oblivious hash function. Mitzenmacher (Mitzenmacher, 2018) later formalized this design, giving a model for when learned Bloom filters help and showing that a "sandwiching" arrangement (a classical filter before and after the learned model) is provably optimal under the analysis.

Learned Index Structures

Kraska et al. (2018) (Kraska et al., 2018) introduced the idea of replacing traditional index structures (B-trees, hash maps, Bloom filters) with learned models. The key insight is that an index is fundamentally a function from keys to positions, and neural networks are universal function approximators. A learned B-tree replacement (using a hierarchy of neural networks) can outperform traditional B-trees when the data distribution has learnable structure; for example, if keys are approximately uniformly distributed, a simple linear model can serve as an index with O(1)O(1) lookup time. However, learned indices degrade under adversarial or irregular key distributions and typically require retraining when the underlying data changes. This work inspired a rich line of research on learned data structures, including the PGM-index (Ferragina & Vinciguerra, 2020) (which uses piecewise linear models for optimal space-time tradeoffs), learned hash functions, learned sorting, and learned query optimization. Subsequent designs such as ALEX (Ding et al., 2020) directly target the update problem, combining a learned layout with gapped storage and local model retraining so that the index stays accurate under insertions rather than requiring a full rebuild.

Applications in AI Systems

Bloom filters and learned data structures are increasingly used within AI systems:

  • Deduplication in pre-training: Approximate methods such as MinHash are used to detect near-duplicate documents in large pre-training corpora (Lee, 2022), and membership-style probabilistic structures like Bloom filters are a natural fit for tracking already-seen content at scale (Mitzenmacher & Upfal, 2017). Given the scale of modern pre-training data (trillions of tokens from web crawls), exact deduplication is infeasible, and probabilistic methods provide a practical approach.
  • Embedding table lookups: In recommendation systems with billions of items, hashing maps item IDs to a bounded embedding table; learned or compositional hashing schemes have been proposed as an alternative to random hashing for managing collisions in this regime.
  • Approximate nearest neighbor indices: ANN indices like FAISS (Johnson et al., 2021) and HNSW (Malkov & Yashunin, 2020) use probabilistic data structures internally for efficient candidate generation and filtering during similarity search.
  • Data selection in training: Bloom filters can track which training examples have been seen in online learning settings, in principle enabling curriculum learning and sampling strategies without storing the full dataset.
  • Routing in MoE models: Expert routing decisions in Mixture-of-Experts models can be viewed conceptually as learned hash functions that map inputs to experts, suggesting a connection between the learned data structures literature and efficient architecture design.

Across these applications, the choice between probabilistic and learned structures tracks a single distinction. When the workload is effectively data-oblivious, with keys or queries that carry no exploitable distributional regularity (adversarial inputs, near-uniform identifiers, or settings that demand worst-case guarantees), classical Bloom filters and their variants are preferred: their space-accuracy tradeoff is analyzable, distribution-free, and cheap to maintain. When the data instead has learnable structure that a model can capture, learned indices and learned Bloom filters can spend fewer bits per element by amortizing that structure, at the cost of training overhead and weaker guarantees when the distribution shifts. In practice the two are often combined, as in the learned Bloom filter that retains a small classical backup filter, so the probabilistic structure underwrites the guarantees while the learned component captures the savings.


References