tl;dr TF-IDF
Words that are highly specific and therefore descriptive of the larger context tend to be longer and less frequently occurring. Alternatively, words that appear with significant frequency in a body of text surely bear some relevance to the article's topic. Following from this reasoning I've added a " hashtagSuggestions " method to my Authtools repo. It can be tuned to the text by altering the desired percentile to use. For frequently occurring text it finds words occurring in the upper percentile of the distribution. For high information words, it finds words in the lower percentile. The problem is that each text is different. Size difference alone contributes to variability. A haiku may not even have a large enough sample size to fit any distribution to. Enter tf-idf. tf denotes Term Frequency, the frequency of a word the text. It's higher for our frequently occurring words. idf reduces this score for words tha...