Licks for Syllables - Entropy & NLP


When I was in high school we had a trend where anytime someone used a word of more than three syllables they'd get punched in the arm for the number of syllables.  This was a physical manifestation of the general social pressure prevalent in high schools to not stick out.  I learned to dumb myself down a little to get by.


At my first job I started to notice I had trouble being interrupted and talked over.  The problem continued until I was thrust into a sink or swim environment in the financial services industry where my livelihood depended on effective communication.

I had the opportunity to ask my company's chief economist how to effectively communicate in an international but technical setting.  At times I'd continued to dumb down my English for the sake non-native speakers but found myself getting interrupted again.  He offered readability scores as a way to practice the right balance. [Code here]

I fiddled with them a bit and noticed that I generally did better when I used more specific vocabulary.  I consciously worked to improve my vocabulary so that I could speak with greater precision.  In doing so I later realized that I'd overcome my problem.  People tended to interrupt less now than before.

Using more specific language allowed me not only to be more precise but more concise.  I was saying more in less time and consequently getting interrupted less.  It occurred to me that this related to the concept of Information Entropy, the fundamental concept behind data compression.
For a closed system, in this case a sentence or text document, there are a finite number of possible states given the number of characters and the number of possible values for each character (depending on ascii, Unicode, etc).  The larger the text or the more varying the character values, the greater number of possible states.

Building from this, we consider a compression scheme whereby we reduce the size of a document in a manner that's reversible without any loss of data.  Such a compressed document can be said to have the same amount of overall information in a less verbose format.  Thus each character can be considered to have a higher informational content that those of the original.

Considering the pigeonhole principle we can see that to compress such a message, some content must be shrunk while others must be expanded.  In fact compression succeeds by finding frequently occurring content and replacing it with a smaller version while replacing infrequent content with larger.  If overall information stays fixed, we can see that frequently occurring content in the original carries less information that infrequent.

We can observe the same phenomenon in English.  Conjunctives and what are called "Stop Words," occur perhaps most frequently but carry little meaning while rare words, legalese for example, tend to be more specific.  Moreover, English for the most part naturally mimics a compression algorithm.  Conjunctives tend to be short and specific words long.  Perhaps the need to be concise naturally built compression into the language over the years.  Supercalifragilisticexpialidocious right? 

Information Entropy measures the overall diversity of content in order to place a limit on the degree of compression that can be achieved.  As it turns out, "word entropy" is a pretty good measure of lexical diversity.  In English, we learn that good prose is diverse.  Overuse of the same vocabulary can cause ambiguity or a droning effect leading to the reader loosing interest.  Lexically diverse text is readable text.  There are already various known ways of measuring it.  I've applied the concept of information entropy at the word level calling it, "word-entropy."

Word-entropy is also an inverse measure of succinctness.  If we return to entropy within the context of thermodynamics we have the third law of thermodynamics which states that entropy is zero at a constant, frozen, organizationally compact state.  As heat is applied it expands and entropy increases.  As an analogue, an optimally succinctly stated concept would use the most specific language available and have a low word entropy score.  In the opposite case we'd have high entropy text expressing the same thing in a circumlocutory fashion.  It's like the difference between the crystalized carbon atoms of a diamond and the dispersed carbon atoms of a gas.  Again, English demonstrates some vague notion of this with slang expressions like "cool" describing someone of few words and "full of hot air," describing someone who's loquacious.  Compare the styles of Hemingway to Melville and you can this this manifest as the difference in thickness of their books.

I'm working on a library that offers readability scores, lexical diversity scores, and even offers the author synonyms for words that may be overused.  Within it I'm experimenting with combining per word syllable count and word entropy as a means of describing a body of text through its most precise vocabulary.

There are several mature tools out there that offer similar general functionality as my library:




Comments

Popular posts from this blog

Engineering Truisms

The Telescoping Constructor (Anti-Pattern)

A Strategic Approach to Precision Process Automation