A new kid in town: Are the new Google Ngram frequencies better than the SUBTLEX word frequencies?

We got alerted by several colleagues to the new Google Ngram Viewer. Given that Ngram=1 equals word frequency and given that these Google Ngrams are calculated on the basis of millions of books, wouldn’t these frequencies be much better than our SUBTLEX word frequencies, based on some 50 million words only?

The answer to this question largely depends on the type of texts used by Google Ngram Viewer. We found that above 30 million words, corpus register is much more important than corpus size for word frequencies. What type of language is used to build the corpus?

There is only one way to test word frequencies for psycholinguistic research: By correlating them with word lexical decision times. As a first analysis we correlated the Google Ngrams and the other estimates of word frequency with the 28.7K words from the British Lexical Project (standardized RTs, which are the most reliable variable). For this analysis, we excluded the few 100 words that were not in Google Ngram (mostly because they were loan words with non-ascii letters). In total we could use a matrix of 28,370 words (0 frequencies were given to the words that were not observed in the smaller corpora).

This was the outcome (all word frequencies were log transformed):

Correlation with SUBTLEX-US (51M words): r = -.635

Correlation with Google Ngram 1 English One Million (93.7B words) : r = -.546

Further analyses indicated that the poor performance of the Google Ngram measure was due to (1) the use of old books, and (2) the fact that non-fiction books were included in the corpus. As our findings below show, the best Google Ngram frequency measure is based on the English Fiction corpus for the period 2000-2008:

Correlation with Google Ngram 1 English One Million restricted to the years 2000 and later (6.10B words): r = -.607

Correlation with Google Ngram 1 English Fiction all years (75.2B words) : r = -.594

Correlation with Google Ngram 1 Englih Fiction years 2000 and later (24.2B words) : -.635

All in all, three interesting findings:

  1. Word frequencies become outdated after some time (hence the better performance of recent word frequencies than for all word frequencies)

  2. The fiction corpus is better than the One MIllion corpus. This presumably has to do with the fact that the fiction corpus better approximates the type of language participants in psychology experiments have been exposed to.

  3. Despite the much larger corpus size, the Google Ngram estimates are not better than the SUBTLEX measures of word frequency (actually, SUBTLEXUS explains 1% more of variance). This agrees with our previous observation that size does not matter much above 30M words.

Article in which the Google Ngrams are announced:

Quantitative Analysis of Culture Using Millions of Digitized Books by: Jean-Baptiste Michel, Yuan K. Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, Erez L. Aiden Science, Published Online 16 December 2010

2 Comments to “A new kid in town: Are the new Google Ngram frequencies better than the SUBTLEX word frequencies?”

  1. marc 19 December 2010 at 19:41 #

    I’ve also finished the analysis with the Elexicon word processing times (http://elexicon.wustl.edu/). Same story there, although SUBTLEX-US for this dataset explains 4-5% more of variance than the best Google Ngram measure (English Fiction 2000+). Here are some correlations:

    1. SUBTLEX – naming (zRT) : r = -.519
    2. SUBTLEX – lexical decision (zRT) : r = -.632
    3. SUBTLEX – lexical decision (acc) : r = .402

    4. Google One Million all – naming : r = -.370

    5. Google One Million all – LDT (zRT) : r = -.519
    6. Google One Million all – LDT (acc) : r = .376

    7. Google Fiction 2000+ – naming : r = -.448

    8. Google Fiction 2000+ – LDT (zRT) : r = -.597
    9. Google Fiction 2000+ – LDT (acc) : r = .427

    So, researchers can safely continue to use the SUBTLEX-US word frequencies. They are still the best in town. Alternatively, some gain in variance explained (up to 2%) seems to be possible by taking the mean of SUBTLEX word frequency and Google Ngram Fiction 2000+ (or another good measure of written frequency such as BNC for British English and HAL for American English).

  2. marc 9 February 2011 at 10:08 #

    Our article on the usefulness of the Google frequencies for psycholinguistic research has now been published in Frontiers in Psychology. You find it on the address:

    http://www.frontiersin.org/language_sciences/10.3389/fpsyg.2011.00027/abstract