Cliche Detection with the Rabin-Karp Algorithm

One of my favorite algorithms is the Rabin-Karp algorithm.  There are other string searching algorithm's that are more efficient, but this one builds naturally from understanding the hash datatype and is simple enough to quickly grasp and recall.

txtHash = (d*(txtHash-ord(txt[left])*h) + ord(txt[left+patternLen]))%q

The genius of the Rabin-Karp is how it builds hashes of strings from prior hashes, removing the component that contributes to the portion of the window that's fallen off and adding the portion that's come into the window.  This reduces Big O complexity substantially.  Personally, I find parsimonious yet effective code like this aesthetically pleasing.

As I understand, the algorithm is often used for plagiarism detection, which makes it a natural fit into the AuthTools library I've been building.  Rabin-Karp really shines when doing a simultaneous multi-pattern search.  One limitation however, is that it's a closed-end search.  If you're looking for stolen prose for instance, it'd only detect an exact copy.  To extend this, I've built a wildcard search on top of the vanilla implementation to allow us to search not only for "...it was a dark and stormy night," but also for "it was a*and* night."  I also added a cliche detector to serve as a limited robo-editor.

The multi-pattern search version of the algorithm is typically implemented with a bloom filter.  I've heard the bloom filter used as a punchline in jokes about engineers being pedantic.  At risk of coming off like that myself I'll briefly talk about it with the goal of building intuition.

A bloom filter is a bit vector for detecting set membership.  The vector starts as all zeroes.  At insertion, several layers of hashing are applied to an entry which then in aggregate sets "one" bits in the vector.  Bits may never return to zero once set and entries cannot be removed.  Testing for membership involves applying the same hashing to input and comparing to the bit vector.  If you find any zeroes you know conclusively that your candidate is absent.  However, due to overlapping in hashes, "one" values don't necessarily prove membership.  The use of multiple hashes helps reduce the probability of false positives but not to 0%.  For this reason the bloom filter is called probabilistic.  It's also space-efficient because a bit vector is cheaper storage-wise than a hashmap of the same values.




Perhaps named for rosy disposition, as the bloom filter gets full it goes from being probabilistic to optimistic.  A bloom filter’s like a happy drunk.  As it gets full it gets increasingly jovial but less competent.  As more bits move to one, the bloom filter says no less and eventually declares any object present.
We can size the bit vector appropriately to counter this tendency.  A sumo wrestler can drink more than a ballerina.  It's the same with a bloom filter.  You should know how much booze you have ahead of time so you need to pick the right sized drinker to avoid getting knackered.  Here's some formulae to help us with that decision:

Optimal Bit Vector Size = [-n * ln(P)] / [(ln(2))^2]
  • n is the number of items
  • P is the rate of false positives
Optimal Rate of False Positives = 1/n
  • n is the number of items
Optimum number of hash functions = ln(2) * A/n
  • A is the bit vector size
  • n is the number of items
It may be helpful to imagine a bloom filter as the logical analogue of a key and lock.  The pins of the lock are like the bit vector, the springs' tension and length of each being the hash functions.  When inserted, the lock can tell you for sure when you have the wrong key but can't tell if yours is the original.

In Rabin-Karp, we hash all patterns we're looking for and add them to the bloom filter.  We then run a rolling hash over the text we're searching, check the the bloom filter, on hits check the hash values, and finally check the full string on matching hashes.

Comments

Popular posts from this blog

Engineering Truisms

The Telescoping Constructor (Anti-Pattern)

A Strategic Approach to Precision Process Automation