Skip to main content


Showing posts from April, 2013

“Bloom Filter” to optimize expensive data lookups

Bloom Filter is very useful when you want to optimize the functionality with following constraints Data lookup is very expensive False positive is acceptable and False negative is un-acceptable A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive retrieval results are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set". –   It stores only value hash in it’s data structure, so it take very less memory against the very huge data. For example if you have set of raw data with 10 7 elements 10 5 distinct value and size ~40MB Bloom filter will store same data as hash in just ~0.6MB. ( ref. ) And because of the same reason it is quite fast. Bloom Filter implementation Guava library has bloom filter implementation .Net implementation

"Stop Words" in full text search

Stop words are the words without significant information. You can consider these words as a filler words. If these words are used in search statement, then result could potentially contain all the record from data store.  It is always a good idea to filter out these words before constructing the search query. It will make the search query smaller, faster and result will be more relevant. “ One of our major performance optimizations for the “related questions” query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster. ” - English stop word list : Some interesting