Archive by Author

How to determine whether one frequency measure is better than the other?

In our research comparing various frequency measures, we usually look at the correlations between the frequency measures and word processing times (e.g., lexical decision times) and we go for the frequency measure with the highest correlation. However, increasingly reviewers (and editors) request to see a p-value when we recommend one frequency measure over another.

As long as we are dealing with megastudy data of 10 thousands of observations, there is not really a point in testing the statistical significance between different measures, as differences as small as .02 are likely to be statistically “significant” (p < .05!). However, when we only have small-scale studies at our disposal, things become different and reviewers are right asking statistical confirmation.

    Hotelling-Williams test for dependent correlations

The test recommended for differences in correlations that are themselves intercorrelated (as is the case for various frequency measures) is the Hotelling-Williams test (Steiger, 1980). You can find the test in several R-packages, but it is reasonably simple to implement one yourself. The figure shows the equation you need. For instance, when the SUBTLEX log frequency correlates .75 with 240 lexical decision times and the Celex log frequency .69 while both log frequency measures have a correlation of .84, then r12 = .75, r13 = .69, r23 = .84, N = 240, t = 2.4934, df = 237, p = .0133. You find an Excel file here that does the calculations for you.

    The Vuong-test and Clarke-test for non-nested models

The Hotelling-Willams test is fine as long as you are dealing with simple correlations. This is a limitation in frequency research because the relationship between word processing times and log frequency is not linear, but levels off at high word frequencies. We capture this aspect by running nonlinear regression analyses (either with polynomials or restricted cubic splines). Then, we have R²-values rather than r-values. For instance, for the above data we would have something like R² = .59 for the SUBTLEX log frequencies, and R² = .51 for the Celex log frequencies (i.e., a few percent above the squared values of the linear correlations). Are these still significant?

The test usually recommended here is the Vuong test (Vuong, 1989). It is based on a comparison of the loglikelihoods of the two models. The calculations are rather complicated, but the test is available in several R-packages, such as games, pscl, spatcounts, or ZIGP (be careful, some require the models to be estimated with the glm-function, other with the lm-function). Clarke (2007) reviewed the Vuong test and found it to be conservative for small N. That is, the test is less likely to yield statistical significance than is warranted. Clarke (2007) proposed an alternative nonparametric test that is claimed not to be conservative.

To test the usefulness of the Vuong and Clarke tests for word frequency research, we ran Monte Carlo simulations of likely scenarios. Each simulation was based on 10K datasets. Per dataset we generated normally distributed variables XYZ that had the following theoretical intercorrelations (these were the same between all three variables): .0, .2, .4, or .6. We additionally varied the number of data triplets: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, or 5120. For each set, we calculated the obtained intercorrelations between the variables and tested whether the correlation between XY was significantly different from the correlation between XZ according to the Hotelling-Williams test, the Vuong test, and the Clarke test. For the sake of simplicity, we only present the percentage of tests for which p < .05 and p < .10.

If the test works well, we expect 5% of the tests to be significant at the .05 level and 10% of the tests to be significant at the .10 level (given that both correlations were generated with the same algorithm and, hence, were assumed to be equivalent at the population level). This was exactly what we obtained with the Hotelling-Williams test, as you can see here. In line with Clarke’s observations, the Vuong test was conservative. Surprisingly, this was not the case for the smallest sample sizes (N = 10) and neither when the variables were intercorrelated with each other (as is the case for frequency measures). The Vuong test was particularly conservative when the theoretical correlations between X, Y, and Z were 0. Certainly for correlations of .4 and .6, the Vuong test was no longer conservative.

In contrast, Clarke’s test was way too liberal, in particular for large sample sizes and intercorrelated variables. In the worst cases, it returned more than 50% significance for a situation in which no differences in correlations were expected. Hence, there is not much you can conclude from a significant Clarke test for the question we are addressing (unless you want to impress reviewers and editors without statistical sophistication who insist on seeing “reassuring” p-values).

Thus far we have only used the Vuong and Clarke test for situations in which the better Hotelling-Williams test applies as well. As indicated above, we need the Vuong or Clarke test more for situations in which more complicated models are compared to each other. Therefore, we checked how well these tests would perform when instead of linear regression we used restricted cubic splines with 3 knots (which allows you to capture the floor effect at high word frequencies). For comparison purposes we also calculated the Hotelling-Williams test on the correlations. The results were reassuring: The introduction of nonlinear regression did not lead to an unwarranted increase in significant tests, as you can see here.

All in all, the Hotelling-Williams test is the best to compare dependent correlations. The Vuong test is a good alternative, unless there is very little correlation between the variables. The Clarke test is less useful for our purposes, because it will often return significance when this is not indicated.

Clarke, K.A. (2007). A Simple Distribution-Free Test for Nonnested Model Selection. Political Analysis, 15, 347-363.

Steiger, J.H. (1980), Tests for comparing elements of a correlation matrix, Psychological Bulletin, 87, 245-251.

Vuong, Q.H. (1989): Likelihood Ratio Tests for Model Selection and non-nested Hypotheses. Econometrica, 57, 307-333.

German SUBTLEX-DE word frequencies available now

Together with colleagues from Münster and Berlin we have collected and validated subtitle-based word frequencies for German. As in other languages, the SUBTLEX-DE word frequencies explain more variance in lexical decision times than the other available word frequency measures, including CELEX, Leipzig, dlexDB, and Google Ngram=1. You find our ms about the SUBTLEX-DE word frequencies here (Brysbaert et al., 2011) and easy to use files with the frequencies here.

Here you find a demo on how to easily enter SUBTLEX-DE values into your stimulus Excel file.

In Zusammenarbeit mit Kollegen aus Münster und Berlin haben wir Worthäufigkeiten für die deutsche Sprache erhoben und validiert. Die Datenbasis waren Filmeuntertiteln. Wie auch in anderen Sprachen erklären die SUBTLEX-DE Worthäufigkeiten mehr Varianz lexikaler Entscheidungszeiten als andere Worthäufigkeitsmaße wie z.B. CELEX, Leipzig, dlexDB und Google Ngram=1. Der Artikel über die SUBTLEX-DE Worthäufigkeiten kann hier heruntergeladen werden (Brysbaert et al., 2011); einfach zu verwendende Dateien, die die Worthäufigkeiten enthalten, hier.


  • Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Test je woordenschat

Stephane Dufau van de Universiteit van Marseille heeft een app ontwikkeld waarbij mensen met een iPod, iPhone of iPad hun woordenschat kunnen testen. De proef bestaat uit een lexicale decisietaak waarbij per pakketje 50 woorden en niet-woorden aangeboden worden (duurt een paar minuten). Daarna krijg je informatie over je prestaties.

UPDATE: Wij hebben aan deze proef meegewerkt van 4 februari 2011 tot 16 maart 2012. Daarna hebben we samen met Nederlandse omroepen een veel ambitieuzere en leerrijkere megastudie over woordenkennis van het Nederlands opgezet. Meer informatie hierover vind je hier.

Veel plezier ermee!

A new kid in town: Are the new Google Ngram frequencies better than the SUBTLEX word frequencies?

We got alerted by several colleagues to the new Google Ngram Viewer. Given that Ngram=1 equals word frequency and given that these Google Ngrams are calculated on the basis of millions of books, wouldn’t these frequencies be much better than our SUBTLEX word frequencies, based on some 50 million words only?

The answer to this question largely depends on the type of texts used by Google Ngram Viewer. We found that above 30 million words, corpus register is much more important than corpus size for word frequencies. What type of language is used to build the corpus?

There is only one way to test word frequencies for psycholinguistic research: By correlating them with word lexical decision times. As a first analysis we correlated the Google Ngrams and the other estimates of word frequency with the 28.7K words from the British Lexical Project (standardized RTs, which are the most reliable variable). For this analysis, we excluded the few 100 words that were not in Google Ngram (mostly because they were loan words with non-ascii letters). In total we could use a matrix of 28,370 words (0 frequencies were given to the words that were not observed in the smaller corpora).

This was the outcome (all word frequencies were log transformed):

Correlation with SUBTLEX-US (51M words): r = -.635

Correlation with Google Ngram 1 English One Million (93.7B words) : r = -.546

Further analyses indicated that the poor performance of the Google Ngram measure was due to (1) the use of old books, and (2) the fact that non-fiction books were included in the corpus. As our findings below show, the best Google Ngram frequency measure is based on the English Fiction corpus for the period 2000-2008:

Correlation with Google Ngram 1 English One Million restricted to the years 2000 and later (6.10B words): r = -.607

Correlation with Google Ngram 1 English Fiction all years (75.2B words) : r = -.594

Correlation with Google Ngram 1 Englih Fiction years 2000 and later (24.2B words) : -.635

All in all, three interesting findings:

  1. Word frequencies become outdated after some time (hence the better performance of recent word frequencies than for all word frequencies)

  2. The fiction corpus is better than the One MIllion corpus. This presumably has to do with the fact that the fiction corpus better approximates the type of language participants in psychology experiments have been exposed to.

  3. Despite the much larger corpus size, the Google Ngram estimates are not better than the SUBTLEX measures of word frequency (actually, SUBTLEXUS explains 1% more of variance). This agrees with our previous observation that size does not matter much above 30M words.

Article in which the Google Ngrams are announced:

Quantitative Analysis of Culture Using Millions of Digitized Books by: Jean-Baptiste Michel, Yuan K. Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, Erez L. Aiden Science, Published Online 16 December 2010

Wuggy best article of BRM 2010

Emmanuel Keuleers won the award for the best article in Behavior Research Methods 2010. His Wuggy pseudoword generator allows researchers to select the best possible nonwords for lexical decision experiments. This can be done in different languages and the program is so user-friendly that everyone can use it. The award will be presented at the next Annual Meeting of the Psychonomic Society in St. Louis, Missouri (November 18-21, 2010). Emmanuel will be there.

We hebben meegewerkt aan een video over dyslexie

De UGent heeft een video gemaakt over studenten met dyslexie die hun studies aan de universiteit volbracht hebben en nu een boeiende carrière aan het uitbouwen zijn. Ze vertellen over de obstakels uit hun studententijd en hoe ze die overkomen hebben. Een must voor iedereen!

Kijk hier naar de video.

En hier vind je een brochure die de video begeleidt.