Subtitle word frequencies for Spanish: SUBTLEX-ESP

Word frequency norms based on film subtitles have been shown to be better than word frequencies based on books and newspapers, because they are more representative for everyday language use. In all languages we tested, word frequencies based on a corpus of 40 million words from film subtitles predict more variance in word recognition times than word frequencies based on much larger written corpora.

Here you find the word frequencies for Spanish. Full information about the collection of the database can be found in our article (Cuetos et al., 2011).

You find an excel file with the SUBTLEX-ESP here.

Here you find a demo on how to easily enter the SUBTLEX frequencies into your stimulus Excel file.

Shortly after the publication of the list, it was brought to our attention that there were some copy errors in the original list of SUBTLEX-ESP , mainly involving non-ASCII characters. In addition, some words had two entries.

These problems became apparent in an article by Angeles Alonso, Fernandez, and Diez (2011) on oral frequency norms for Spanish words. Although SUBTLEX-ESP did reasonably well, its performance was less than we had expected.

We think we now have corrected all errors. The corrected version has 44,374 words in common with Angeles Alonso et al. (instead of 42,609). The correlation with the oral frequencies now is .72 (was .67). R² for the naming times of Cuetos & Barbon (2006) now is .308 (was .290); R² for the picture naming times from Cuetos, Ellis & Alvarez (1999) is .118 (was .033). There are no changes for the analyses reported by Cuetos et al. (2011).

To make sure you are using the correct version of SUBTLEX-ESP, check the following words:

  • cenar [dine] : should have a frequency count of 3721
  • verdad [truth] : should have a frequency count of 54203

We thank Manolo Perea and Maria Angeles Alonso for their feedback. If you find other problems in our databases, please let us know. Although we try to control our data as much as possible, it is impossible to completely avoid programming errors with such vast databases.

References:

Alonso, M.A., Fernandez, A., & Diez, E. (2011). Oral frequency norms for 67,979 Spanish words. Behavior Research Methods, 43, 449-458.

Cuetos, F., Glez-Nosti, M., Barbon, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicologica, 32, 133-143. pdf

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

The data of the British Lexicon Project are available now

Now that our paper on the British Lexicon Project is published in Behavior Research Methods (Keuleers et al., 2012), we are delighted we can make the data of the British Lexicon Project available to other users. For the time being, you have to download them as databases (there are various formats). Once we have munched them over some more, we will make a search engine for them similar to the one for the Dutch Lexicon Project.

Here you find a demo on how to easily enter BLP information into your stimulus Excel file.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

How to determine whether one frequency measure is better than the other?

In our research comparing various frequency measures, we usually look at the correlations between the frequency measures and word processing times (e.g., lexical decision times) and we go for the frequency measure with the highest correlation. However, increasingly reviewers (and editors) request to see a p-value when we recommend one frequency measure over another.

As long as we are dealing with megastudy data of 10 thousands of observations, there is not really a point in testing the statistical significance between different measures, as differences as small as .02 are likely to be statistically “significant” (p < .05!). However, when we only have small-scale studies at our disposal, things become different and reviewers are right asking statistical confirmation.

    Hotelling-Williams test for dependent correlations

The test recommended for differences in correlations that are themselves intercorrelated (as is the case for various frequency measures) is the Hotelling-Williams test (Steiger, 1980). You can find the test in several R-packages, but it is reasonably simple to implement one yourself. The figure shows the equation you need. For instance, when the SUBTLEX log frequency correlates .75 with 240 lexical decision times and the Celex log frequency .69 while both log frequency measures have a correlation of .84, then r12 = .75, r13 = .69, r23 = .84, N = 240, t = 2.4934, df = 237, p = .0133. You find an Excel file here that does the calculations for you.

    The Vuong-test and Clarke-test for non-nested models

The Hotelling-Willams test is fine as long as you are dealing with simple correlations. This is a limitation in frequency research because the relationship between word processing times and log frequency is not linear, but levels off at high word frequencies. We capture this aspect by running nonlinear regression analyses (either with polynomials or restricted cubic splines). Then, we have R²-values rather than r-values. For instance, for the above data we would have something like R² = .59 for the SUBTLEX log frequencies, and R² = .51 for the Celex log frequencies (i.e., a few percent above the squared values of the linear correlations). Are these still significant?

The test usually recommended here is the Vuong test (Vuong, 1989). It is based on a comparison of the loglikelihoods of the two models. The calculations are rather complicated, but the test is available in several R-packages, such as games, pscl, spatcounts, or ZIGP (be careful, some require the models to be estimated with the glm-function, other with the lm-function). Clarke (2007) reviewed the Vuong test and found it to be conservative for small N. That is, the test is less likely to yield statistical significance than is warranted. Clarke (2007) proposed an alternative nonparametric test that is claimed not to be conservative.

To test the usefulness of the Vuong and Clarke tests for word frequency research, we ran Monte Carlo simulations of likely scenarios. Each simulation was based on 10K datasets. Per dataset we generated normally distributed variables XYZ that had the following theoretical intercorrelations (these were the same between all three variables): .0, .2, .4, or .6. We additionally varied the number of data triplets: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, or 5120. For each set, we calculated the obtained intercorrelations between the variables and tested whether the correlation between XY was significantly different from the correlation between XZ according to the Hotelling-Williams test, the Vuong test, and the Clarke test. For the sake of simplicity, we only present the percentage of tests for which p < .05 and p < .10.

If the test works well, we expect 5% of the tests to be significant at the .05 level and 10% of the tests to be significant at the .10 level (given that both correlations were generated with the same algorithm and, hence, were assumed to be equivalent at the population level). This was exactly what we obtained with the Hotelling-Williams test, as you can see here. In line with Clarke’s observations, the Vuong test was conservative. Surprisingly, this was not the case for the smallest sample sizes (N = 10) and neither when the variables were intercorrelated with each other (as is the case for frequency measures). The Vuong test was particularly conservative when the theoretical correlations between X, Y, and Z were 0. Certainly for correlations of .4 and .6, the Vuong test was no longer conservative.

In contrast, Clarke’s test was way too liberal, in particular for large sample sizes and intercorrelated variables. In the worst cases, it returned more than 50% significance for a situation in which no differences in correlations were expected. Hence, there is not much you can conclude from a significant Clarke test for the question we are addressing (unless you want to impress reviewers and editors without statistical sophistication who insist on seeing “reassuring” p-values).

Thus far we have only used the Vuong and Clarke test for situations in which the better Hotelling-Williams test applies as well. As indicated above, we need the Vuong or Clarke test more for situations in which more complicated models are compared to each other. Therefore, we checked how well these tests would perform when instead of linear regression we used restricted cubic splines with 3 knots (which allows you to capture the floor effect at high word frequencies). For comparison purposes we also calculated the Hotelling-Williams test on the correlations. The results were reassuring: The introduction of nonlinear regression did not lead to an unwarranted increase in significant tests, as you can see here.

All in all, the Hotelling-Williams test is the best to compare dependent correlations. The Vuong test is a good alternative, unless there is very little correlation between the variables. The Clarke test is less useful for our purposes, because it will often return significance when this is not indicated.

Clarke, K.A. (2007). A Simple Distribution-Free Test for Nonnested Model Selection. Political Analysis, 15, 347-363.

Steiger, J.H. (1980), Tests for comparing elements of a correlation matrix, Psychological Bulletin, 87, 245-251.

Vuong, Q.H. (1989): Likelihood Ratio Tests for Model Selection and non-nested Hypotheses. Econometrica, 57, 307-333.

German SUBTLEX-DE word frequencies available now

Together with colleagues from Münster and Berlin we have collected and validated subtitle-based word frequencies for German. As in other languages, the SUBTLEX-DE word frequencies explain more variance in lexical decision times than the other available word frequency measures, including CELEX, Leipzig, dlexDB, and Google Ngram=1. You find our ms about the SUBTLEX-DE word frequencies here (Brysbaert et al., 2011) and easy to use files with the frequencies here.

Here you find a demo on how to easily enter SUBTLEX-DE values into your stimulus Excel file.

In Zusammenarbeit mit Kollegen aus Münster und Berlin haben wir Worthäufigkeiten für die deutsche Sprache erhoben und validiert. Die Datenbasis waren Filmeuntertiteln. Wie auch in anderen Sprachen erklären die SUBTLEX-DE Worthäufigkeiten mehr Varianz lexikaler Entscheidungszeiten als andere Worthäufigkeitsmaße wie z.B. CELEX, Leipzig, dlexDB und Google Ngram=1. Der Artikel über die SUBTLEX-DE Worthäufigkeiten kann hier heruntergeladen werden (Brysbaert et al., 2011); einfach zu verwendende Dateien, die die Worthäufigkeiten enthalten, hier.

Reference

  • Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Visual Word Recognition (vwr) package for R released

We have just released vwr, an R package to assist in computations often needed in visual word recognition research.

The manual for the package can be found here.

Vwr includes functions to:

  • Compute levenshtein distances between strings
  • Compute hamming distances (overlap distance) between strings
  • Compute neighbors based on the levenshtein and hamming distance
  • Compute Coltheart’s N and average levenshtein distances (e.g., Yarkoni et al’s OLD20 measure). These functions run in parallel on multiple cores and offer a major speed advantage when computing these values for large lists of words.

The package also includes the ldknn algorithm, a method that we recently proposed to examine  how balanced a lexical decision task is (i.e., how easy it is to discriminate the words from the nonwords in an experiment given no other information than the stimuli in the experiment). A preliminary version of that paper can be found here.

As the package is listed on CRAN, it can be installed like any other official package for R.

We’d be happy to have your feedback. As this is just a first version, it is guaranteed to be not error free. Also, feel free to pass it on to interested colleagues for testing.

Test je woordenschat

Stephane Dufau van de Universiteit van Marseille heeft een app ontwikkeld waarbij mensen met een iPod, iPhone of iPad hun woordenschat kunnen testen. De proef bestaat uit een lexicale decisietaak waarbij per pakketje 50 woorden en niet-woorden aangeboden worden (duurt een paar minuten). Daarna krijg je informatie over je prestaties.

UPDATE: Wij hebben aan deze proef meegewerkt van 4 februari 2011 tot 16 maart 2012. Daarna hebben we samen met Nederlandse omroepen een veel ambitieuzere en leerrijkere megastudie over woordenkennis van het Nederlands opgezet. Meer informatie hierover vind je hier.

Veel plezier ermee!

New Wuggy Version with Vietnamese language module

We just released a new version of Wuggy, our nonword generator. This version includes a Vietnamese language module (thanks to Hien Pham), and many improvements to the German language module.

A new kid in town: Are the new Google Ngram frequencies better than the SUBTLEX word frequencies?

We got alerted by several colleagues to the new Google Ngram Viewer. Given that Ngram=1 equals word frequency and given that these Google Ngrams are calculated on the basis of millions of books, wouldn’t these frequencies be much better than our SUBTLEX word frequencies, based on some 50 million words only?

The answer to this question largely depends on the type of texts used by Google Ngram Viewer. We found that above 30 million words, corpus register is much more important than corpus size for word frequencies. What type of language is used to build the corpus?

There is only one way to test word frequencies for psycholinguistic research: By correlating them with word lexical decision times. As a first analysis we correlated the Google Ngrams and the other estimates of word frequency with the 28.7K words from the British Lexical Project (standardized RTs, which are the most reliable variable). For this analysis, we excluded the few 100 words that were not in Google Ngram (mostly because they were loan words with non-ascii letters). In total we could use a matrix of 28,370 words (0 frequencies were given to the words that were not observed in the smaller corpora).

This was the outcome (all word frequencies were log transformed):

Correlation with SUBTLEX-US (51M words): r = -.635

Correlation with Google Ngram 1 English One Million (93.7B words) : r = -.546

Further analyses indicated that the poor performance of the Google Ngram measure was due to (1) the use of old books, and (2) the fact that non-fiction books were included in the corpus. As our findings below show, the best Google Ngram frequency measure is based on the English Fiction corpus for the period 2000-2008:

Correlation with Google Ngram 1 English One Million restricted to the years 2000 and later (6.10B words): r = -.607

Correlation with Google Ngram 1 English Fiction all years (75.2B words) : r = -.594

Correlation with Google Ngram 1 Englih Fiction years 2000 and later (24.2B words) : -.635

All in all, three interesting findings:

  1. Word frequencies become outdated after some time (hence the better performance of recent word frequencies than for all word frequencies)

  2. The fiction corpus is better than the One MIllion corpus. This presumably has to do with the fact that the fiction corpus better approximates the type of language participants in psychology experiments have been exposed to.

  3. Despite the much larger corpus size, the Google Ngram estimates are not better than the SUBTLEX measures of word frequency (actually, SUBTLEXUS explains 1% more of variance). This agrees with our previous observation that size does not matter much above 30M words.

Article in which the Google Ngrams are announced:

Quantitative Analysis of Culture Using Millions of Digitized Books by: Jean-Baptiste Michel, Yuan K. Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, Erez L. Aiden Science, Published Online 16 December 2010

Wuggy best article of BRM 2010

Emmanuel Keuleers won the award for the best article in Behavior Research Methods 2010. His Wuggy pseudoword generator allows researchers to select the best possible nonwords for lexical decision experiments. This can be done in different languages and the program is so user-friendly that everyone can use it. The award will be presented at the next Annual Meeting of the Psychonomic Society in St. Louis, Missouri (November 18-21, 2010). Emmanuel will be there.

We hebben meegewerkt aan een video over dyslexie

De UGent heeft een video gemaakt over studenten met dyslexie die hun studies aan de universiteit volbracht hebben en nu een boeiende carrière aan het uitbouwen zijn. Ze vertellen over de obstakels uit hun studententijd en hoe ze die overkomen hebben. Een must voor iedereen!

Kijk hier naar de video.

En hier vind je een brochure die de video begeleidt.

Understanding the Part of Speech (PoS) information in SUBTLEX-NL

Marc Brysbaert & Emmanuel Keuleers

In processing the subtitle corpus on which SUBTLEX-NL was based, we used the wonderful Tadpole software made freely available by Tilburg University’s ILK lab. Tadpole allowed us to distinguish meaningful word units from punctuation in the text (tokenizing), to tag these units according to their Part of Speech, such as Noun, Verb, Adjective (PoS tagging), and to group related forms under a single heading, as would be done in a dictionary  (lemmatizing), for instance, a verb with its present tense and past tense forms, a noun with its plural and diminutive forms, etc.

[…]

Hoe vaak worden Nederlandse woorden gebruikt ?

Hoe vaak worden Nederlandse woorden gebruikt?

Een nieuwe, gemakkelijk te gebruiken database van woordfrequenties

Marc Brysbaert   Emmanuel Keuleers

Vakgroep Experimentele Psychologie, Universiteit Gent

Marc.brysbaert@ugent.be

Onderzoek heeft aangetoond dat we het snelst stimuli herkennen die we vroeg geleerd hebben en die we dikwijls gezien hebben. Dit heeft belangrijke gevolgen voor het leren van nieuw materiaal (bijv. bij het leren lezen of het leren van een nieuwe taal). De eerste woorden die we leren en de woorden die we het meest zien en horen, onthouden we het best. Dit is goed nieuws, omdat het leerkrachten en hulpverleners handvaten geeft om het leerproces te optimaliseren. Recent onderzoek in ons centrum laat ons vermoeden dat we een woord al beduidend sneller herkennen als we het 20 keer gelezen hebben dan als we het nog maar 10 keer gelezen hebben. Het zogenaamde woordfrequentie-effect is dus geen effect dat alleen maar van belang is bij woorden die we al honderden keren gezien hebben!

Welke zijn dan de meest frequente woorden in het Nederlands? Om hier een antwoord op te krijgen, hebben onderzoekers teksten ingescand en het aantal woorden geteld. De bekendste woordfrequentielijst is de Celexlijst, samengesteld door het Max Planck Instituut in Nijmegen in de jaren 1980-1990. Jammergenoeg is deze lijst nogal moeilijk te gebruiken (zie op  http://celex.mpi.nl/) en bovendien zijn de woordfrequenties niet altijd juist, omdat ze op nogal oude teksen voor volwassenen gebaseerd zijn.

We krijgen betere woordfrequentiematen als we ondertitels van films en televisieprogramma’s gebruiken. Dit is zo in alle talen die we getest hebben: het Frans, het Engels, het Chinees, het Spaans, en ook het Nederlands. Wanneer we kijken welke woorden studenten goed en snel kunnen lezen, dan zijn de woordfrequenties op basis van ondertitels een betere voorspeller dan de Celexfrequenties. Meer hierover kun je lezen in Keuleers, Brysbaert, & New (2010). Een bijkomend voordeel is dat de woordfrequenties op basis van ondertitels vrij beschikbaar zijn en gedownload kunnen worden vanop het internet of zelfs rechtstreeks opgevraagd kunnen worden. De Excelfile ziet er als volgt uit:

De woorden zijn geordend van hoogfrequent naar laagfrequent (in ondertitels komen “ik” en “je” dus het vaakst voor). De betekenis van de kolommen is als volgt:

  • FREQcount: het aantal keren dat het woord voorkomt in het corpus (op een totaal van 43.8 miljoen woorden).
  • CDcount: het aantal films/programma’s waarin het woord voorkomt (op een totaal van 8070).
  • FREQlow en CDlow : zelfde informatie maar wanneer het woord begint met een kleine letter. Dit geeft belangrijke informatie over welke woorden vooral met een hoofdletter beginnen. Dit is bijvoorbeeld zo voor “wat” (begint 220 duidend keren met een kleine letter en 260 duizend keren met een hoofdletter).
  • FREQlemma: de som van de verbuigingen (bijv. voor “koord” is dat de som van “koord” en “koorden”).
  • SUBTLEXWF: frequentie per 1 miljoen woorden.
  • Lg10WF : logaritme van de frequentie.
  • SUBTLEXCD : frequentie per 100 documenten.
  • Lg10CD : logartime van het aantal documenten.

Gebruikers die graag meer weten over het gebruik van woorden in hun verschillende syntactische rollen (bijv. “spelen” als werkwoord vs. als zelfstandig naamwoord) vinden hun gading in de file met lemmas en woordvormen. Je kunt hier ook rechtstreeks de frequenties vinden van de woorden die je interesseren (zie ook hier voor meer uitleg over de online opzoekingen).

De gegevens van het ondertitelcorpus leveren een paar contra-intuïtieve inzichten op. Zo blijkt dat we met een woordenschat van 61 woorden al de helft van de woorden kunnen verstaan die mensen in conversaties gebruiken! Een op de vijf woorden die we zeggen, zijn “ik”, “je”, “het”, “de”, “dat”, “is”, “niet”, of “een”. Met een woordenschat van 1000 woorden begrijp je al 82% van de woorden die gezegd worden. De percentages voor 2000 en 3000 woorden zijn respectievelijk 87% en 90% (wat wel betekent dat we met een woordenschat van 3000 woorden nog altijd 1 op de 10 woorden niet zullen begrijpen). Inzicht in dergelijke wetmatigheden zorgt ervoor dat we ons onderwijs en onze hulpverlening sterk kunnen optimaliseren. Het is immers veel interessanter om een relatief beperkte groep van vaak gebruikte woorden goed en herhaaldelijk in te oefenen dan om een een lange lijst slechts eenmaal door te nemen.

Zo zijn er in het Engels zo’n 400 onregelmatige werkwoorden (waarvan een aantal samengesteld zijn, zoals “outrun”). Een oninteressante manier om die aan te leren is om ze allemaal te willen geven. Veel van die werkwoorden komen immers zo goed als niet voor in het alledaagse taalgebruik (bijv., abide, alight, backslide, befall, beget, behold, bend, bereave, beseech, …). Een veel interessantere onderwijsmethode is om uit te gaan van de vraag hoe vaak een leerling een woord nodig zal hebben en dan te zien welke werkwoorden de belangrijkste zijn (voor Engelse ondertitelfrequenties, zie http://expsy.ugent.be/subtlexus). Dan blijkt dat een lijst van zo’n 50 werkwoorden het overgrote deel van alle gebruikte werkwoorden bevat. De volgende werkwoorden zouden zeker in de lijst voorkomen: be, have, do, know, get, like, go, come, think, see, let, take, tell, say, make, give, find, put, think, win, keep, feel, make, leave, hear, show, understand, hold, meet, run, bring. Het is veel belangrijker om deze woorden goed in te oefenen dan om de volledige lijst van werkwoorden half te kennen.

SUBTLEX frequencies are a powerful predictor of number processing speed

Our colleague Ineke Imbo recently used the SUBTLEX-NL word frequencies in her research on number processing and found that the SUBTLEX-NL frequencies for the numbers 1-99 are the most powerful predictor of their naming latencies.


Table: Correlation between number naming latencies and several frequency measures. All frequency measures were log transformed.

SUBTLEX-NL number frequencies (Keuleers, Brysbaert, & New, 2010) SUBTLEX-NL number word frequencies (Keuleers, Brysbaert, & New, 2010) GOOGLE number frequencies (Verguts & Fias, 2006) ESTIMATED number frequencies (Gielen et al., 1991)
Dutch Naming Latencies (L1) -.537 -.518 -.513 -.427
English Naming Latencies(L2) -.609 -.642 -.539 -.570

Twenty students at Ghent University (Belgium) (sixteen females, four males; mean age 19.7) named all numbers from 1 to 99 in two blocks. In one block they named the numbers in Dutch (L1) and in the other block they named the numbers in English (L2). Each number was presented four times per block. A trial started with a fixation point (500 ms), after which the number was presented in Arabic digits. Timing began when the stimulus appeared and ended when the participant’s response triggered the sound-activated relay. Invalid trials due to naming errors or faulty time registration were removed. The table below shows the correlations between the Dutch and English naming latencies, on the one hand, and the different frequency measures, at the other hand. The log-transformed frequencies from the SUBTLEX-NL database turn out to correlate best with the observed naming latencies – and this is true for both languages (L1 and L2).