Posts

Showing posts from July, 2018

tl;dr TF-IDF

Image
Words that are highly specific and therefore descriptive of the larger context tend to be longer and less frequently occurring.  Alternatively, words that appear with significant frequency in a body of text surely bear some relevance to the article's topic. Following from this reasoning I've added a " hashtagSuggestions " method to my Authtools repo.  It can be tuned to the text by altering the desired percentile to use.  For frequently occurring text it finds words occurring in the upper percentile of the distribution.  For high information words, it finds words in the lower percentile.  The problem is that each text is different.  Size difference alone contributes to variability.  A haiku may not even have a large enough sample size to fit any distribution to. Enter tf-idf.  tf denotes Term Frequency,  the frequency of a word the text. It's higher for our frequently occurring words. idf  reduces this score for words tha...

Cache, Query, or Calculate

Image
Memoization, tabulation, even dynamic programming could be considered a kind of local memory caching.  A nice front end JS pattern is the interceptor in Angular for caching .  I use caching generously on the server side in most APIs I write.  Most ORMs have it built in and even offer second level caching .  Out on the web  CDNs and mirror sites are caches.  In fact caching has even more uses than it has names. Caches are useful in many places.  Real world application bottlenecks are almost always around I/O.  You can optimize your algorithm till the cows come home but if you haven't first looked at your I/O, you're improvements will likely be a drop in the ocean.  Often a quick and easy way to alleviate I/O bottlenecks is to slap on an  appropriate  cache. A particular problem where we see caching is in the classic Fibonacci problem.  The Fibonacci sequence is a pattern frequently occurring in nature, but in Computer Sc...

Cliche Detection with the Rabin-Karp Algorithm

Image
One of my favorite algorithms is the Rabin-Karp algorithm .  There are other string searching algorithm's that are more efficient, but this one builds naturally from understanding the hash datatype and is simple enough to quickly grasp and recall. txtHash = (d * (txtHash - ord (txt[left]) * h) + ord (txt[left + patternLen])) % q The genius of the Rabin-Karp is how it builds hashes of strings from prior hashes, removing the component that contributes to the portion of the window that's fallen off and adding the portion that's come into the window.  This reduces Big O complexity substantially.  Personally, I find parsimonious yet effective code like this aesthetically pleasing. As I understand, the algorithm is often used for plagiarism detection, which makes it a natural fit into the AuthTools library I've been building .  Rabin-Karp really shines when doing a simultaneous multi-pattern search.  One limitation however, is that it's a closed-end search...