Skip to main content

“Bloom Filter” to optimize expensive data lookups


Bloom Filter is very useful when you want to optimize the functionality with following constraints
  1. Data lookup is very expensive
  2. False positive is acceptable and
  3. False negative is un-acceptable

A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive retrieval results are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set".wikipedia.org
 360px-Bloom_filter_speed.svg

It stores only value hash in it’s data structure, so it take very less memory against the very huge data.
For example if you have set of raw data with
  • 107 elements
  • 105 distinct value
  • and size ~40MB
Bloom filter will store same data as hash in just ~0.6MB. (ref.)
And because of the same reason it is quite fast.

Bloom Filter implementation
Guava library has bloom filter implementation
.Net implementation

Bloom Filter In Use
  1. Chrome uses Bloom filters to make a preliminary decision whether a particular web site is malicious or safe.
  2. Google Big table uses it to reduce disk lookups
  3. Squid Web Proxy Cache uses Bloom filters for cache digests
  4. Cassandra uses to save IO when performing a key lookup

Useful links
Bloom filter by example
Space-Efficient Bloom Filters
"More Optimal Bloom Filters," Ely Porat (Nov/2007) Google TechTalk video on YouTube



Comments

Popular posts from this blog

ERROR: Ignored call to 'alert()'. The document is sandboxed, and the 'allow-modals' keyword is not set.

Recently I found this issue while writing code snippet in "JSFiddle". And after searching, found this was happening because of new feature added in "Chrome 46+". But at the same time Chrome doesn't have support for "allow-modals" property in "sandbox" attribute. Chromium issue for above behavior: https://codereview.chromium.org/1126253007 To make it work you have to add "allow-scripts allow-modals" in "sandbox" attribute, and use "window.alert" instead of "alert". <!-- Sandbox frame will execute javascript and show modal dialogs --> <iframe sandbox="allow-scripts allow-modals" src="iframe.html"> </iframe> Feature added: Block modal dialog inside a sandboxed iframe. Link: https://www.chromestatus.com/feature/4747009953103872 Feature working Demo page: https://googlechrome.github.io/samples/block-modal-dialogs-sandboxed-iframe/index.html

Application Design Notes

Don’t be afraid to write your own code, but be absolutely sure you need to Don't reinvent the wheel Learn more about your libraries and take full advantage  Date time calculation is hard ( leap second ,  leap year ), use trusted library  js-joda ,  momentJs ,  joda (java) Simple is better than perfect (nearly) every time If you can deliver a sub-optimal solution (that solves the problem but has known limitation) in a week instead of a full featured one in a month DO, IT Simple system are Easy to reason about  Easy to debug Easy to refactor Easy to learn Simple doesn't mean you skip good engineering, but you can use duct tape. Build things the right way from the start, refactoring is hard and expensive Security Manage and store passwords securely Telemetry Common retrofitting "grunt work" Internationalization + localization Web Content Accessibility Factoring and styling HTML UI Adding unit test to an existing codebase LOG LOG LOG Log, but do it right We spend lot of t

How to store user password at server!!!

Trick is, you should never store user password… never ever. Now the real question is, then how to authenticate and authorize the user with password. And answer is when user enter the password, we should encrypt the password and store the hints. So next time when user enter the password we follow the same process and compare hints, if both hints are same then password is matched, else it is wrong password. Next question will be, what kind of hints, and how to generate these hints. In simple term hints are the obfuscated and fragmented form of user password. And very important part is hints generation process, which have to be collision resistant , means there will be very less possibility to find the data which generate same hints (like Cryptographic hashing functions ). Below is the simple checklist of password hashing and storing, which you should always keep in mind. PS You're Probably Storing Passwords Incorrectly Storing Passwords - done rig