My colleagues and I published a paper a few months ago on a strange topic, which is: Why are some nonwords funny? (I blogged about this previously when I wrote about Schopenhauer's expectation violation theory of humor.) The study got a lot of press, in part because the paper included the graph above, which shows that Dr. Seuss's silly nonwords (like wumbus, skritz and yuzz-a-ma-tuzz) were predicted to be funny by our mathematical analysis. You can read about the study in many places (for example, in The Guardian and The Walrus) or, if you have access, get the original paper here.
A lot of people got confused about the way we measured how a NW was funny, in part because we were loose about how we used the term 'entropy' in the paper (though very clear about what we had measured). Journalists understood that we had shown that words were funnier to the extent that they were improbable, and that we had used this measure 'entropy', but most journalists did not report the measure we used correctly. Most thought that we had said strings with higher entropy are funnier, which is incorrect. Here I explain what we actually measured and how it relates to entropy.
Shannon entropy was defined (by Shannon in this famous paper, by analogy to the meaning in physics of the term 'entropy') over a given signal or message. It is presented in that paper as a function of the probabilities of all symbols across the entire signal, i.e. across a set of symbols whose probabilities sum to 1. I italicize this because it emphasizes that entropy is properly defined over both rare and common symbols, by definition, because it is defined over all symbols in the signal.
Under Claude Shannon’s definition, a signal like ‘AAAAAA’ (or, identically, ‘XXXXXX’) has the lowest possible entropy, while a signal like ‘ABCDEF’ (or, identically, ‘AABBCCDDEEFF’, which has identical symbol probabilities) has the highest possible entropy. The idea, essentially, was to quantify information (a synonym for entropy in Shannon's terminology) in terms of unpredictability. A perfectly predictable message like ‘AAAAAA’ has the lowest information, for the same reason you would hit me if I walked into your office and said “Hi! Hi! Hi! Hi! Hi! Hi!”. After I have said it once, you have my point–I am saying hello–and repeating it adds nothing more = it is uninformative.
So, Shannon entropy is defined across the signal that is the English language as a function of the probabilities of the 26 possible symbols, the letters A-Z (we can ignore punctuation and case; we could include them easily enough but they don’t change the general idea and played no role in our nonwords).
If we do the math (by summing -p(X)log(X), for every letter in the alphabet, which is how Shannon entropy is defined), the entropy of English is 4.2 bits. What this means is that I could send any message in English using a binary string for each letter of length 5. This makes perfect sense if you know binary code: 2^5 = 32, which gives us more codes than we need to code just 26 symbols…so concretely, A = 00000, B = 00001, C = 00010, and so on until we get to Z = 11010).
What we computed in our paper can be conceived of as the contribution of each nonword to this total entropy of the English language, that string's own -p(X)log(X). In essence, we treated each nonword as one of part of a very long signal that is the English language. This is indeed a measure of how unlikely a particular string is, but that is not entropy, because entropy is measure of summed global unpredictability, not local probability.
Think of it this way: If I am speaking and I say 'I love the cat, I love the dog, and I love snunkoople’, you will be struck by snunkoople because it is surprising, which is a synonym for unexpected. We quantified how unexpected each nonword was (the local probability of that part of the signal), in the context of a signal that is English as she is spoken (or written).
Our main finding was that the less likely the nonword is to be a word of English—basically, the lower the total probability of the letters the nonword contains–the funnier it is. This is not just showing that 'weird strings are funny', but something more interesting that: that strings are funny to the extent that they are weird.
There is an interesting implicit corollary (not discussed in the paper), which is that we are the kind of creatures that use emotion to do probability judgments. Our feelings about how funny a nonword string is are correlated with the probability of that string. If you think about that, it may seem deeply weird, but I think it is not so weird. One of the main functions of emotion is to alert us embodied creatures to unusual, dangerous, or unpredictable aspects of the world that might harm us. Unusualness and unpredictability are statistical concepts, since they are defined by exceptions to the norm. So it makes good sense that emotion and probability estimation would be linked for embodied creatures.