Show HN: PDD – Probabilistic De-Duplication of Streams with Bloom Filters

snissn · on Dec 6, 2016

I've been running similar algorithms in production for around 4 years. One issue we've never solved wonderfully are unit tests on our bloom filter algorithms, we've used end to end tests to our satisfaction but not really tests of the bloom filter algorithms itself. Do you have some test / benchmark code that you could recommend for testing different bloom filter false positive rates?

mgunlogson · on Dec 6, 2016

Google's Guava Bloom implementation is very well tested. I suggest looking at their guava test GitHub repo. They even use pre-seeded and manually verified filters as controls as some responses suggested.

Or you could just use Guava's Bloom ;)

As for probabilistic testing of fp rate... The problem is that every once in a while a test will fail.

Disclaimer: I wrote a cuckoo filter library.

If you want to test fp rate check my cuckoo filter test at sanityOverFillFilter() in github.com/MGunlogson/CuckooFilter4J/blob/master/src/test/java/com/github/mgunlogson/cuckoofilter4j/TestCuckooFilter.java

The unit test basically does fuzzing, filling the filter repeatedly in the hopes that one day any errors will surface. The error bounds are pretty large but small enough to detect any egregious failures. Importantly, my filter defaults to a random seed. Guava DOES NOT, so any tests using the same items will be deterministic. The guava filters use this property to verify some filters that have been manually determined to be correct

jparkie · on Dec 6, 2016

I'm thinking of building a test suite which takes a distribution of distinct elements and run a test for FPP and FNP. Then it asserts whether the the actual FPP and FNP is within an epsilon of the calculated probability.

I want to add statistics for mine to hopefully be able to monitor them at least by some estimate.

orlp · on Dec 6, 2016

I would like to add to this that it's incredibly important you use fixed seed, or at least logged seed tests for this.

If you fail to do this you get unreproducable ghost tests that you can never investigate if they happen to fail rarely.

fatdog · on Dec 5, 2016

Fascinating. Just reading about probabilistic data structures in general, and would like to know if there is an efficient general method for generating statistics about the FN rates. Are they related to ROC curves?

jparkie · on Dec 6, 2016

I haven't looked into the relationship with ROC curves.

I'm not sure of an efficient general method for generating statistics. I know empirically testing the structures is the easy way. For Bloom Filters which evict old data, you can calculate the probability of FN by calculating the probability that a given element is a duplicate but reported as distinct.

throwawygybj · on Dec 6, 2016

Bloom and cuckoo filters are designed to have zero FN rate, at least classic ones. Caches and these filters are basically inversely related. One has no false negatives, the other no false positives.

In the vast majority of situations where false negatives are okay you're much better off just caching a hash of each object traditionally

contravariant · on Dec 5, 2016

This looks like a neat technique, but it's bothering me slightly that 'peekDistinct' returns false if the element you're testing is distinct.

jparkie · on Dec 5, 2016

My bad. README/Documentation error. The tests are fine though.

muzakthings · on Dec 6, 2016

Is this a library that does entity recognition and matching on individual elements in streams, or are the streams themselves being featurized/matched?

jparkie · on Dec 6, 2016

This library provide probabilistic set membership test on an individual element basis as the stream progresses with a fixed bound on memory. Every call to classifyDistinct() progresses the internal history of a ProbabilisticDeDuplicator.

Traditionally, probabilistic set membership tests were accomplished by using Bloom Filters. However, Bloom Filters only work well on finite sets because as Bloom Filters age (fill up), the false positive rate approaches 1. This issue was addressed by Stable Bloom Filters which frees up space for inserts; however, this introduces false negatives.

This library contributes three implementations of de-duplication algorithms centered around Bloom Filter variants whose design and replacement strategy allows it to reach stability faster than a Stable Bloom Filter while reducing the false FNR by even several orders of magnitude.

u42p37yYYRjvhD · on Dec 6, 2016

[flagged]

u42p37yYYRjvhD · on Dec 6, 2016

[flagged]

Kdparker · on Dec 6, 2016

Green name means relatively new user