A new kid in town: Are the new Google Ngram frequencies better than the SUBTLEX word frequencies?
We got alerted by several colleagues to the new Google Ngram Viewer. Given that Ngram=1 equals word frequency and given that these Google Ngrams are calculated on the basis of millions of books, wouldn’t these frequencies be much better than our SUBTLEX word frequencies, based on some 50 million words only?
The answer to this question largely depends on the type of texts used by Google Ngram Viewer. We found that above 30 million words, corpus register is much more important than corpus size for word frequencies. What type of language is used to build the corpus?
There is only one way to test word frequencies for psycholinguistic research: By correlating them with word lexical decision times. As a first analysis we correlated the Google Ngrams and the other estimates of word frequency with the 28.7K words from the British Lexical Project (standardized RTs, which are the most reliable variable). For this analysis, we excluded the few 100 words that were not in Google Ngram (mostly because they were loan words with non-ascii letters). In total we could use a matrix of 28,370 words (0 frequencies were given to the words that were not observed in the smaller corpora).
This was the outcome (all word frequencies were log transformed):
Correlation with SUBTLEX-US (51M words): r = -.635
Correlation with Google Ngram 1 English One Million (93.7B words) : r = -.546
Further analyses indicated that the poor performance of the Google Ngram measure was due to (1) the use of old books, and (2) the fact that non-fiction books were included in the corpus. As our findings below show, the best Google Ngram frequency measure is based on the English Fiction corpus for the period 2000-2008:
Correlation with Google Ngram 1 English One Million restricted to the years 2000 and later (6.10B words): r = -.607
Correlation with Google Ngram 1 English Fiction all years (75.2B words) : r = -.594
Correlation with Google Ngram 1 Englih Fiction years 2000 and later (24.2B words) : -.635
All in all, three interesting findings:
Word frequencies become outdated after some time (hence the better performance of recent word frequencies than for all word frequencies)
The fiction corpus is better than the One MIllion corpus. This presumably has to do with the fact that the fiction corpus better approximates the type of language participants in psychology experiments have been exposed to.
Despite the much larger corpus size, the Google Ngram estimates are not better than the SUBTLEX measures of word frequency (actually, SUBTLEXUS explains 1% more of variance). This agrees with our previous observation that size does not matter much above 30M words.
Article in which the Google Ngrams are announced:
Quantitative Analysis of Culture Using Millions of Digitized Books by: Jean-Baptiste Michel, Yuan K. Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, Erez L. Aiden Science, Published Online 16 December 2010
I’ve also finished the analysis with the Elexicon word processing times (http://elexicon.wustl.edu/). Same story there, although SUBTLEX-US for this dataset explains 4-5% more of variance than the best Google Ngram measure (English Fiction 2000+). Here are some correlations:
SUBTLEX – lexical decision (acc) : r = .402
Google One Million all – naming : r = -.370
Google One Million all – LDT (acc) : r = .376
Google Fiction 2000+ – naming : r = -.448
So, researchers can safely continue to use the SUBTLEX-US word frequencies. They are still the best in town. Alternatively, some gain in variance explained (up to 2%) seems to be possible by taking the mean of SUBTLEX word frequency and Google Ngram Fiction 2000+ (or another good measure of written frequency such as BNC for British English and HAL for American English).
Our article on the usefulness of the Google frequencies for psycholinguistic research has now been published in Frontiers in Psychology. You find it on the address:
http://www.frontiersin.org/language_sciences/10.3389/fpsyg.2011.00027/abstract