Bloom filters
Sat, 04/28/2007 - 01:14 by Damien Leroy • Categories:
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Our work on Bloom filter is motivated by two observations. First, some false positives might be more troublesome than others, and these can be identified after the Bloom filter has been constructed, but before it is used. For instance, when IP addresses arise in measurements, it is not uncommon for some addresses to be encountered with much greater frequency than others. If such an address triggers a false positive, the performance detriment is greater than if a rarely encountered address does the same. If there were a way to remove them from the filter before use, the application would benefit. Second, the application can tolerate a low level of false negatives. It would benefit from being able to trade off the most troublesome false positives for some randomly introduced false negatives.
We proposed the retouched Bloom filters (RBF), an extension that makes the Bloom filter more flexible by permitting the removal of selected false positives at the expense of generating random false negatives.
In future work, we hope to demonstrate techniques to apply the RBF concept earlier in the construction of the filter. At present, we allow the Bloom filter to be built, and then remove the most troublesome false positives. It should be possible to avoid recording some of these false positives in the filter to begin with.
Short Bibliography
- H. B. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. in Communication of the ACM. 1970.
- L. Fan, P. Cao, J. Almeida and A. Z. Broder. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. in Proc. ACM Sigcomm. 1998.
A. Broder and M. Mitzenmacher. Network Applications of Bloom Filters: A Survey. in Proc. Allerton Conference. 2002. - M. Mitzenmacher. Compressed Bloom Filters. in Proc. 20th Annual ACM Symposium on Principles of Distributed Computing. 2001.
- B. Chazelle, J. Killian, R. Rubinfeld and A. Tal. The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Table in Proc. 15th ACM/SIAM Symposium on Discrete Algorithms (SODA). 2004.
- S.C. Rhea and J. Kubiatowicz. Probabilistic Location and Routing. in Proc. IEEE Infocom. 2002.
- S. Cohen and Y. Matias. Spectral Bloom Filters. in Proc. ACM SIGMOD International Conferance on Management of Data. 2003.