Wednesday 4 January 2017

On How Many Words We Know


How many words does the average person know? This sounds like it should be an easy question, but it is actually very difficult to answer with any confidence. There are a lot of complications.

One complication is that is it not easy to say what it means 'to know' a word. Language users can often recognize many real words whose meaning they cannot explain. Does merely recognizing a word count as 'knowing' it? If not, if we have to know what a word means to count it, how can we decide what it means to 'know the meaning of a word'? As any university professor who has marked term papers will attest, many of us occasionally use words in a way that is not quite consistent with their actual meaning. In such cases, we think we know a word, but we don't really know what it means.

A second complication is that it is not totally obvious what we should count when we reckon vocabulary size. Although the word cats is a different word than the word cat, it might seem unreasonable to count both words when we are counting vocabulary, since it essentially means that all nouns will be counted twice, once in their singular and once in their plural form. The same problem arises for many other words. What about verbs? Should we count meow, meows, meowing, and meowed as four different words? What about catlike, catfightcatwoman, and cat-lover? Since English is so productive (allows us to so easily make up words by glomming together words and morphemes, subparts of words like the pluralizing 's'), it gets even more confusing when we start considering words that might not yet really exist but easily could if we wanted them to: catfest, catboy, catless, cattishness, catliest, catfree, and so on. A native English speaker will have no trouble understanding the following (totally untrue) sentence: I used to be so cattish that I held my own catfest, but now I am catfree.

The third complication is a little more subtle, and hangs on the meaning of the term 'the average person'. In an insightful paper published a couple of years ago (Ramscar, Hendrix, Shaoul, Milin, & Baayen, 2014), researchers from Tubingen University in Germany argued (among other things) that it was very difficult to measure an adult's vocabulary with any reliability. Assume, reasonably, that there are a number of words that more or less everyone of the same educational level and age all know. If we test people only on those words, those people will (by the assumption) all show the same vocabulary size. The problem arises when we go beyond that common vocabulary to see who has the largest vocabulary outside of that core set of words. Ramscar et al. argued (and demonstrated, with a computational simulation) that the additional (by definition, infrequent) words people would know on top of the common vocabulary are likely to be idiosyncratic, varying according to the particular interests and experiences of the individuals. A non-physician musician might know many words that a non-musician physician does not, and vice versa. They wrote: "Because the way words are distributed in language means that most of an adult's vocabulary comprises low-frequency types ([...] highly familiar to some people; rare to unknown to others), [...] the assumption that one can infer an accurate estimate of the size of the vocabulary of an adult native speaker of a language from a small sample of the words that the person knows is mistaken". Essentially, the only fair way to assess the true vocabulary size of adults (i.e. of those who have mastered the common core vocabulary) would be to give a test that covered all of the possible idiosyncratic vocabularies, which is impossible since it would require vocabulary tests composed of tens of thousands of words, most of which would be unknown to any particular person.

So, is it just impossible to say how many words the average person knows? No. It is possible, as long as you define your terms and gather a lot of data. A recent paper (Brysbaert, Stevens, Mandera, and Keuleers, 2016) made a very careful assessment of vocabulary size. To address the first complication (What does it mean to know a word?), they used the ability to recognize a word as their criterion, by asking many people (221,268 people, to be exact) to make decisions about whether strings were a word or a nonword. To address the second issue (What counts as a word?), they focused on lemmas, which are words in their 'citation form', essentially those that appear in a dictionary as headwords. A dictionary will list cat, but not cats; run, but not running; and so on. If this seems problematic to you, you are right. Brysbaert et al. mention (among other attempts to identify all English lemmas) Goulden et al's (1990) analysis of the 208,000 entries in the (1961) Websters Third New International Dictionary. That analysis was able to identify 54,000 lemmas as base words, 64,000 as derived words (variants of a base word that had their own entry), and 67,000 as compound words, but also found that 22,000 of the dictionary headwords were unclassifiable. Nevertheless, Brysbaert et al. settled on a lemma list of length 61,800. To address the third issue (What is an average person?) they presented results by age and education, which they were able to do because they had a huge sample.

And so they were able to come up with what is almost certainly the best estimate to date of vocabulary size (drumroll please): "The median score of 20-year-olds is [...] 42,000 lemmas; that of 60-year-olds [...] 48,200 lemmas." They also note that this age discrepancy suggests that we learn on average one new lemma every 2 days between the ages of 20 and 60 years.

As I hope the discussion above makes clear, 48,200 lemmas is not the same as 48,200 words, as the term is normally understood... Because they focused on lemmas specifically to address the problem of saying what a word is, Brysbaert et al. didn't speculate on how many words a person knows [1], where we define words as something like 'strings in attested use that are spelled differently'. I have guesstimated myself, informally and very roughly, that about 40% of words are lemmas, so I would guesstimate that we could multiply these lemma counts by about 2.5, and say that an average 20-year old English speaker knows about 105,000 words and an average 60-year-old English speaker knows about 120,500 words...but now I just muddying much clearer and more careful work.

[1] Update: After this was published to the blog, Marc Brysbaert properly chastised me for failing to note that their paper includes the sentence "Multiplying the number of lemmas by 1.7 gives a rough estimate of the total number of word types people understand in American English when inflections are included", with a reference to the Golden, Nation, and Read (1990) paper. He also noted that this does not include proper nouns. Without boring you with the details of how I came to my estimate of the multiplier, I will note that my estimate was made on a corpus-based dictionary that included many proper nouns, so our estimates of how to go from lemmas to words are perhaps fairly close. My apologies to the authors for mis-representing them on this point.

Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age. Frontiers in Psychology, 7.

Goulden R., Nation I. S. P., & Read J. (1990). How large can a receptive vocabulary be? Applied Linguistics, 11, 341–363.

Ramscar, M., Hendrix, P., Shaoul, C., Milin, P., & Baayen, H. (2014). The myth of cognitive decline: Non‐linear dynamics of lifelong learning. Topics in Cognitive Science, 6(1), 5-42.

No comments:

Post a Comment