“Bloom Filter” to optimize expensive data lookups
Bloom Filter is very useful when you want to optimize the functionality with following constraints
- Data lookup is very expensive
- False positive is acceptable and
- False negative is un-acceptable
A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive retrieval results are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set". – wikipedia.org
It stores only value hash in it’s data structure, so it take very less memory against the very huge data.
For example if you have set of raw data with
- 107 elements
- 105 distinct value
- and size ~40MB
And because of the same reason it is quite fast.
Bloom Filter implementation
Guava library has bloom filter implementation
Bloom Filter In Use
- Chrome uses Bloom filters to make a preliminary decision whether a particular web site is malicious or safe.
- Google Big table uses it to reduce disk lookups
- Squid Web Proxy Cache uses Bloom filters for cache digests
- Cassandra uses to save IO when performing a key lookup
Bloom filter by example
Space-Eﬃcient Bloom Filters
"More Optimal Bloom Filters," Ely Porat (Nov/2007) Google TechTalk video on YouTube