New spelling (word dictation) tests

There are individual differences in language processing as a function of language exposure (measured with a vocabulary test) and orthographic precision (measured with a spelling test).

We have developed and tested an English and a Dutch word dictation test, which you can find here.

Nederlands en Engels dictee (woorden) voor studenten

Nieuwe leestest 1 minuut voor onderzoek bij studenten

Uit ons onderzoek bleek dat de Een-Minuut-Test (Brus & Voeten, 1991) een van de beste tests is om studenten met dyslexie te screenen. Bij deze test moeten de proefpersonen zo veel mogelijk woorden in 1 minuut luidop lezen.

Er zijn echter twee beperkingen aan de test:

  1. Er moet betaald worden om ze af te nemen.
  2. Het aantal woorden is niet hoog genoeg, zodat sommige studenten alle woorden al gelezen hebben voordat de minuut om is.

Om aan de beperkingen tegemoet te komen, hebben we een nieuwe test ontwikkeld en gevalideerd. Je vindt de test hier. De officiële referentie is:

Tops, W., Nouwels, A., & Brysbaert, M. (2019).  Een nieuw screeningsinstrument voorleesonderzoek bij Nederlandse studenten:de Leestest 1-minuut studenten (LEMs). Stem-, Spraak- en Taalpathologie, 24, 1-22.

 

 

Een nieuwe Nederlandse woordenschattest met meerkeuze antwoorden

A new Dutch vocabulary test with 40 multiple choice items

In Vander Beken, Woumans, & Brysbaert (2018) we published a new Dutch vocabulary test with 75 multiple choice items for advanced users (typically university students). The test has a reliability of .84 (Cronbach’s alpha) and correlates .6 with English L2 proficiency (indicating that bilinguals who are good in one language are typically good in the other as well). The test can be used for free for research purposes.

After a question from the Ghent University Museum for a shorter test and after receiving the results from a large scale test with 50 of the questions by Doreleijers & van der Sijs (2019), we ran an item analysis, which showed that the test could be shortened to 40 questions without loss of information.

You find the original test with 75 questions here.

You find the cleaned version with 40 questions and the data from the item analysis here.

 

Een nieuwe Nederlandse woordenschattest met 40 meerkeuze items

In Vander Beken, Woumans, & Brysbaert (2018) hebben we een nieuwe Nederlandse test woordenschat gepubliceerd met 75 meerkeuze items. De test is bedoeld voor gevorderde taalgebruikers (typisch universiteitsstudenten). De test heeft een betrouwbaarheid van  .84 (Cronbach alfa) and correleert .6 met een Engelse woordenschattest (wat erop wijst dat tweetaligen die goed zijn in 1 taal dat meestal ook zijn in de andere taal). De test kan gratis gebruikt worden voor onderzoek.

Na een vraag van het Gents Universiteitsmuseum voor een kortere test en nadat we de resultaten kregen van een grootschalig onderzoek waarin 50 van onze items gebruikt werden door Doreleijers & van der Sijs (2019), hebben we een itemanalyse uitgevoerd, die aantoonde dat de test ingekort kon worden tot 40 vragen zonder verlies aan informatie.

Je vindt de oorspronkelijke test met 75 vragen hier.

Je vindt de ingekorte versie met 40 vragen en de data van de itemanalyse hier.

Op deze website vind je nog andere woordenschattests.

 

Reference

Vander Beken, H., Woumans, E., & Brysbaert, M. (2018). Studying texts in a second language: No disadvantage in long-term recognition memory. Bilingualism: Language and Cognition, 21(4), 826-838. pdf

 

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

A list of word megastudies with links to data (if available)

As the number of megastudies is growing, it becomes difficult to keep track of everything that is out there. For a paper I decided to make a table and then realized that it would be great to have the information on a website with links to the articles and the datasets.

You find the outcome here. The list contains all the megastudies and eye movement corpora that I am aware of. Originally I wanted to work with a cut-off criterion of minimally 1,000 words (as the lower limit of the definition of mega), but it rapidly became clear that this excluded several interesting datasets. So, for the sake of completeness I dropped the criterion, although it still feels odd to me that you can have a megastudy with less than 1,000 stimuli.

Enjoy! And please contact me if you know of more datasets.

 

References

Adelman, J. S., Marquis, S. J., Sabatos-DeVito, M. G., & Estes, Z. (2013). The unexplained nature of reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(4), 1037-1053.

Aguasvivas, J., Carreiras, M., Brysbaert, M., Mandera, P., Keuleers, E., & Duñabeitia, J. A. (2018). SPALEX: A Spanish lexical decision database from a massive online data collection. Frontiers in Psychology, 9, 2156. doi: 10.3389/fpsyg.2018.02156.

Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283-316.

Balota, D. A. & Spieler, D. H. (1998).  The utility of item level analyses in model evaluation:  A reply to Seidenberg & Plaut (1998). Psychological Science, 9(3), 238-240.

Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, M. J. (2013). Megastudies: What do millions (or so) of trials tell us about lexical processing? In J. S. Adelman (Ed.), Visual Word Recognition Volume 1: Models and methods, orthography and phonology (pp. 90-115). New York, NY: Psychology Press.

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445-459.

Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). The impact of word prevalence on lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance, 42, 441-458.

Chang, Y. N., Hsu, C. H., Tsai, J. L., Chen, C. L., & Lee, C. Y. (2016). A psycholinguistic database for traditional Chinese character naming. Behavior Research Methods, 48(1), 112-122.

Cohen-Shikora, E. R., Balota, D. A., Kapuria, A., & Yap, M. J. (2013). The past tense inflection project (PTIP): Speeded past tense inflections, imageability ratings, and past tense consistency measures for 2,200 verbs. Behavior research methods, 45(1), 151-159.

Cop, U., Dirix, N., Drieghe, D., & Duyck, W. (2017). Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading. Behavior Research Methods, 49(2), 602-615.

Cortese, M.J., Hacker, S., Schock, J. & Santo, J.B. (2015a). Is reading aloud performance in megastudies systematically influenced by the list context? Quarterly Journal of Experimental Psychology, 68, 1711-1722. doi: 10.1080/17470218.2014.974624

Cortese, M.J., Khanna, M.M., & Hacker, S. (2010) Recognition memory for 2,578 monosyllabic words. Memory, 18, 595-609. DOI: 10.1080/09658211.2010.493892.

Cortese, M.J., Khanna, M.M., Kopp, R., Santo, J.B, Preston, K.S., & Van Zuiden, T. (2017). Participants shift response deadlines based on list difficulty during reading aloud megastudies, Memory & Cognition, 45, 589-599.

Cortese, M.J., McCarty D.P., & Schock, J. (2015b). A mega recognition memory study of 2,897 disyllabic words. Quarterly Journal of Experimental Psychology, 68, 1489-1501. doi: 10.1080/17470218.2014.945096

Cortese, M. J., Yates, M., Schock, J., & Vilks, L. (2018). Examining word processing via a megastudy of conditional reading aloud. Quarterly Journal of Experimental Psychology, 71(11), 2295-2313.

Davies, R., Barbón, A., & Cuetos, F. (2013). Lexical and semantic age-of-acquisition effects on word naming in Spanish. Memory & Cognition, 41(2), 297-311.

Dufau, S., Grainger, J., Midgley, K. J., & Holcomb, P. J. (2015). A thousand words are worth a picture: Snapshots of printed-word processing in an event-related potential megastudy. Psychological Science, 26(12), 1887-1897.

Ernestus, M., & Cutler, A. (2015). BALDEY: A database of auditory lexical decisions. The Quarterly Journal of Experimental Psychology, 68(8), 1469-1488.

Ferrand, L., Brysbaert, M., Keuleers, E., New, B., Bonin, P., Meot, A., Augustinova, M., & Pallier, C. (2011). Comparing word processing times in naming, lexical decision, and progressive demasking: evidence from Chronolex. Frontiers in Psychology, 2:306. doi: 10.3389/fpsyg.2011.00306.

Ferrand, L., Méot, A., Spinelli, E., New, B., Pallier, C., Bonin, P., … & Grainger, J. (2018). MEGALEX: A megastudy of visual and auditory word recognition. Behavior Research Methods, 50(3), 1285-1307.

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M., & Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42, 488-496.

Frank, S. L., Monsalve, I. F., Thompson, R. L., & Vigliocco, G. (2013). Reading time data for evaluating broad-coverage models of English sentence processing. Behavior Research Methods, 45(4), 1182-1190.

Frank, S. L., Otten, L. J., Galli, G., & Vigliocco, G. (2015). The ERP response to the amount of information conveyed by words in sentences. Brain and language, 140, 1-11.

Futrell, R., Gibson, E., Tily, H. J., Blank, I., Vishnevetsky, A., Piantadosi, S. T., & Fedorenko, E. (2018) The Natural Stories Corpus. In Proceedings of LREC 2018, Eleventh International Conference on Language Resources and Evaluation (pp. 76—82). Miyazaki, Japan.

González-Nosti, M., Barbón, A., Rodríguez-Ferreiro, J., & Cuetos, F. (2014). Effects of the psycholinguistic variables on the lexical decision task in Spanish: A study with 2,765 words. Behavior Research Methods, 46(2), 517-525.

Heyman, T., Van Akeren, L., Hutchison, K. A., & Storms, G. (2016). Filling the gaps: A speeded word fragment completion megastudy. Behavior Research Methods, 48(4), 1508-1527.

Husain, S., Vasishth, S., and Srinivasan, N. (2014). Integration and prediction difficulty in Hindi sentence comprehension: Evidence from an eye-tracking corpus. Journal of Eye Movement Research, 8(2), 1-12.

Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen-Shikora, E. R., Tse, C. S., … & Buchanan, E. (2013). The semantic priming project. Behavior Research Methods, 45(4), 1099-1114.

Kessler, B., Treiman, R., & Mullennix, J. (2002). Phonetic biases in voice key response time measurements. Journal of Memory and Language, 47, 145-171.

Keuleers, E & Balota, D.A. (2015) Megastudies, crowd-sourcing, and large datasets in psycholinguistics: An overview of recent developments, The Quarterly Journal of Experimental Psychology. 68, (8) 1457-1468.

Keuleers, E., Diependaele, K. & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers in Psychology 1:174. doi: 10.3389/fpsyg.2010.00174.

Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287-304.

Laurinavichyute, A. K., Sekerina, I. A., Alexeeva, S., Bagdasaryan, K., & Kliegl, R. (2019). Russian Sentence Corpus: Benchmark measures of eye movements in reading in Russian. Behavior Research Methods.

Lee, C. Y., Hsu, C. H., Chang, Y. N., Chen, W. F., & Chao, P. C. (2015). The feedback consistency effect in Chinese character recognition: Evidence from a psycholinguistic norm. Language and Linguistics, 16(4), 535-554.

Lemhöfer, K., Dijkstra, T., Schriefers, H., Baayen, R. H., Grainger, J., & Zwitserlood, P. (2008). Native language influences on word recognition in a second language: A megastudy. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34(1), 12-31.

Liu, Y., Shu, H., & Li, P. (2007). Word naming and psycholinguistic norms: Chinese. Behavior Research Methods, 39(2), 192-198.

Luke, S. G., & Christianson, K. (2018). The Provo Corpus: A large eye-tracking corpus with predictability norms. Behavior Research Methods, 50(2), 826-833.

Mousikou, P., Sadat, J., Lucas, R., & Rastle, K. (2017). Moving beyond the monosyllable in models of skilled reading: Mega-study of disyllabic nonword reading. Journal of Memory and Language, 93, 169-192.

Pexman, P. M., Heard, A., Lloyd, E., & Yap, M. J. (2017). The Calgary semantic decision project: concrete/abstract decision data for 10,000 English words. Behavior Research Methods, 49(2), 407-417.

Pritchard, S. C., Coltheart, M., Palethorpe, S., & Castles, A. (2012). Nonword reading: Comparing dual-route cascaded and connectionist dual-process models with human data. Journal of Experimental Psychology: Human Perception and Performance, 38(5), 1268.

Pynte, J., & Kennedy, A. (2006). An influence over eye movements in reading exerted from beyond the level of the word: Evidence from reading English and French. Vision Research, 46(22), 3786-3801.

Schröter, P., & Schroeder, S. (2017). The Developmental Lexicon Project: A behavioral database to investigate visual word recognition across the lifespan. Behavior Research Methods, 49(6), 2183-2203.

Seidenberg, M.S., & Waters, G.S. (1989). Word recognition and naming: A mega study. Bulletin of the Psychonomic Society, 27, 489.

Spieler D. H., & Balota, D. A. (1997).  Bringing computational models of word naming down to the item level. Psychological Science, 8(6), 411-416.

Sze, W. P., Liow, S. J. R., & Yap, M. J. (2014). The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behavior Research Methods, 46(1), 263-273.

Treiman, R., Mullennix, J., Bijeljac-Babic, R., & Richmond-Welty, E. D. (1995). The special role of rimes in the description, use, and acquisition of English orthography. Journal of Experimental Psychology: General, 124, 107-136.

Tsang, Y. K., Huang, J., Lui, M., Xue, M., Chan, Y. W. F., Wang, S., & Chen, H. C. (2018). MELD-SCH: A megastudy of lexical decision in simplified Chinese. Behavior Research Methods, 50(5), 1763-1777.

Tse, C. S., Yap, M. J., Chan, Y. L., Sze, W. P., Shaoul, C., & Lin, D. (2017). The Chinese Lexicon Project: A megastudy of lexical decision performance for 25,000+ traditional Chinese two-character compound words. Behavior Research Methods, 49(4), 1503-1519.

Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadić, F., & Sims, M. (2019). The Massive Auditory Lexical Decision (MALD) database. Behavior Research Methods.

Winsler, K., Midgley, K. J., Grainger, J., & Holcomb, P. J. (2018). An electrophysiological megastudy of spoken word recognition. Language, Cognition and Neuroscience, 1-20.

Yap, M. J., Liow, S. J. R., Jalil, S. B., & Faizal, S. S. B. (2010). The Malay Lexicon Project: A database of lexical statistics for 9,592 words. Behavior Research Methods, 42(4), 992-1003.

 

Test-based AoA measures for 44 thousand English words

Age of acquisition (AoA) is an important variable in word recognition research. Up to now, nearly all psychology researchers examining the AoA effect have used ratings obtained from adult participants. An alternative basis for determining AoA is directly testing children’s knowledge of word meanings at various ages. In educational research, scholars and teachers have tried to establish the grade at which particular words should be taught by examining the ages at which children know various word meanings. Such a list is available from Dale and O’Rourke’s (1981) Living Word Vocabulary for nearly 44 thousand meanings coming from over 31 thousand unique word forms and multiword expressions. In Brysbaert & Biemiller (2018) we relate these test-based AoA estimates to lexical decision times as well as to AoA adult ratings, and report strong correlations between all of the measures. Therefore, test-based estimates of AoA can be used as an alternative measure.

You find an Excel file with the test-based AoA norms here.

If you use the norms, please refer to our article:

Brysbaert, M., & Biemiller, A. (2017). Test-based Age-of-Acquisition norms for 44 thousand English word meanings. Behavior Research Methods, 49(4), 1520-1523. pdf

Measures of word prevalence for 61,800 English words

At long last we found time to make the English word prevalence measures available.

Word prevalence indicates how many people know a word. Because percentage known has an uninteresting distribution, word prevalence is calculated on the basis of a probit transformation. The following are interesting landmarks:

  • negative prevalence values: words known by less than 50% of the people; only of interest for word learning studies
  • prevalence = 0.0 : 50% of the people know this word
  • prevalence = 1.0 : 84% know the word
  • prevalence = 1.5 : 93% know the word
  • prevalence = 2.0 : 98% know the word
  • prevalence = 2.5 : nearly everyone knows the word

You find all the information in:

  • Brysbaert, M., Mandera, P., McCormick, S.F., & Keuleers, E. (2019). Word prevalence norms for 62,000 English lemmas. Behavior Research Methods, 51(2),  467-479. pdf

You find an Excel file with the word prevalence norms for English here.

We now also have word prevalence norms for English L2 speakers: Which words do they know and which not. You find them here.

We also have reaction times for the same 61,800 words. You find them here.

If you want more information about the use of word prevalence, have a look at our findings in Dutch.

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Semantic vectors for Italian

Marco Marelli, connected to the center, published Italian semantic vectors. You can find his article here.

This is the reference:

Marelli, M. (2017). Word-Embeddings Italian Semantic Spaces. A semantic model for psycholinguistic research. Psihologija, 50(4), 503–520.

You can do online searches for semantic similarities, semantic neighbors and analogies of Italian words here (for a large co-occurrence window) or here (for a small co-occurrence window; see the article to know which one to use for which question).

Vettori semantici per l’italiano

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Power analysis and effect size in mixed effects models: A tutorial

We’ve published the outcome of 4 years of study and computer simulations on the power of designs that include more than one observation per condition per participant. Indeed, a problem about the current studies on the replication crisis is that power is always calculated on the assumption that each participant only provides one observation per condition. This is not what happens in experimental psychology, where participants respond to multiple stimuli per condition and where the data are averaged per condition or (preferentially) are analyzed with mixed effects models.

Main findings

In a nutshell, these are our findings:

  1. In experimental psychology we can do replicable research with 20 participants or less if we have multiple observations per participant per condition, because we can turn rather small differences between conditions into effect sizes of d > .8 by averaging across observations (as indeed known to psychophysicists for almost a century). This is the positive outcome of the analyses.

  2. The more sobering finding is that the required number of observations is higher than the numbers currently used (which is why we run underpowered studies). The ballpark figure we propose for RT experiments with repeated measures is 1600 observations per condition (e.g., 40 participants and 40 stimuli per condition).

  3. The 1600 observations we propose is when you start a new line of research and don’t know what to expect. The article gives you the tools to optimize your design once you’ve run the first study.

  4. Standardized effect sizes in analyses over participants (e.g., Cohen’s d) depend on the number of stimuli that were presented. Hence, you must include the same number of observations per condition if you want to replicate the results. The fact that the effect size depends on the number of stimuli also has implications for meta-analyses.

If you use the article please refer to it as follows:

  • Brysbaert, M. and Stevens, M. (2018). Power Analysis and Effect Size in Mixed Effects Models: A Tutorial. Journal of Cognition, 1: 9, 1–20, DOI: https://doi.org/10.5334/joc.10.

Power for other models

Because we got many questions on power after writing the ms (and people rarely appreciated the answers we gave), we decided to write a prequel dealing with power requirements for simple designs. You find the text here (Brysbaert, 2019).

Missed studies in the article

After the publication of the article, it has become clear that other researchers already noticed the relationship between number of stimuli and standardized effect size. Usually this was framed in a negative way (i.e., the effect sizes are overestimated when based on the average of multiple observations), without paying attention to the more positive side for power. Here are some pointers:

  • Brand et al. (2010) already noticed the relationship between number of stimuli per condition and standardized effect sizes. They additionally point to the importance of the correlation between the observations: The higher the correlation, the less multiple observations will increase the standardized effect size (and arguably the less they will help to make the study more powerful).

  • Richard Morey (2016) also noticed that the standardized effect sizes in F1 analyses depend on the number of observations per condition. Maybe the effect size proposed by Westfall et al. is the preferred measure for future use? Alternatively, in reaction time experiments nothing may be more informative than the raw effect in milliseconds.

  • There was an interesting observation by Jeff Rouder pointing to the increased power of experiments with multiple observations. His rule of thumb (if you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition) agrees quite well with the norm we put forward (a properly powered reaction time experiment with repeated measures has at least 1,600 word observations per condition). With 2000-3000 observations per condition you have high powered experiment, with 1600 you have a properly powered experiment. Within limits (say a lower limit of 20), in most experiments the numbers of trials and participants can be exchanged, depending on how difficult it is to create items or to find participants.

More recent publications of interest

Kolossa & Kopp (2018) report that for model testing in cognitive neuroscience it is more important to obtain extra data per participant than testing more participants.

Rouder & Haaf (2018) published an article that nicely complements ours. They make a theoretical analysis of when extra trials improve power. The basic message is that extra participants are always better than extra trials. However, the degree to which this is the case depends on the phenomenon you are investigating. If there is great interindividual variation in the effect and if the variation is theoretically expected, you need many participants rather than many trials (of course). This is true for many experiments in social psychology. In contrast, when the effect is expected to be present in each participant and when trial variability is larger than the variability across participants, you can trade people for trials. These conditions were met for the priming studies we discussed. No participant was expected to show a negative orthographic priming effect (faster lexical decision times after unrelated primes than after related primes), and the variability in the priming effect across participants (and stimuli) was much smaller than the residual error. These conditions are true for many robust effects investigated in cognitive psychology, in particular for those investigated with reaction times. Indeed, many studies in cognitive psychology address the borderline conditions of well-established effects (to make a distinction between alternative explanations).

Another article warning against being too cheap on the number of trials per condition was published by Boudewyn et al. (2018). If you look at their small effect sizes (remember these are the ones we are after most of the time!), the recommendation of 40 participants 40 trials seems to hold for EEG research as well.

Nee (2019) nicely describes how extra runs improve the replicability of fMRI data, even with rather small sample sizes (n = 16). This is the good old psychophysics approach.

Inconsistencies in underpowered fMRI studies are nicely described by Munzon & Hernandez (2019), who started from a large sample (like we did) and looked at what would have been found in smaller samples. Well worth a read! Another article worth reading is Ramus et al. (2018), who document the many inconsistencies in fMRI research on dyslexia and convincingly relate this to the problem of underpowered studies.

Our article does not deal with interactions. A nice blog by Roger Giner-Sorolla (based on work by Uri Simonsohn) indicates that for an extra variable with 2 levels, it is advised to multiply the number of observations by at least 4 if you want to draw meaningful conclusions about the interaction (see also Brysbaert, 2019). So, beware of including multiple variables in your study. Is the interaction really needed to test your hypothesis?

Power of interactions also features in a review paper on power issues by Perugini et al. (2018).

Goulet & Cousineau (2019) discuss how you can use the reliability of your dependent variable to determine the best ratio of number of trials vs. number of participants (a message also in Brysbaert, 2019).

 

We’ve collaborated to validate a new set of 750 pictures for picture naming experiments

We have collaborated to validate a new set of 750 colored pictures for picture naming research, compiled by Jon Andoni Dunabeitia at the Basque Center on Cognition, Brain and Language. In particular, we have collected name agreement data for Belgian Dutch. Other languages that have been added are Spanish, British English, French, German, Italian, and Netherlands’ Dutch.

You find all information (including files about name agreement and raw data files) at the BCBL website (see the link above).

Please refer to the database as follows:

Dunabeitia, J.A., Crepaldi, D., Meyer, A.S., Pliatsikas, C, Smolka, E., & Brysbaert, M. (In press) MultiPic: A standardized set of 750 drawings with norms for six European languages. Quarterly Journal of Experimental Psychology. pdf

Dutch keywords: set plaatjes, benoeming, prenten, onderzoek, woordbenoeming, psycholinguïstiek

How many words do we know?

How large is the size of our vocabulary? Based on an analysis of the literature and a large scale crowdsourcing experiment, we estimate that an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas or about one new lemma every 2 days. The knowledge of the words can be as shallow as knowing that the word exists. In addition, people learn tens of thousands of inflected forms and proper nouns (names), which account for the substantially high numbers of ‘words known’ mentioned in other publications.

You find the full details of our calculation of the vocabulary size here.

Here you find the file with all the lemmas and word families (as it turned out, for some reason a few words were lost in the file I uploaded to frontiers, among which again, against and ahead).

Semantic vectors for words in English and Dutch

Algorithms become increasingly powerful to derive word meanings from word co-occurrences in texts. Paweł Mandera has compared the various algorithms to select the best one so far for use in psycholinguistic research. This turns out to be the Continuous Bag of Words (CBOW) model (Mikolov, Chen, Corrado, & Dean, 2013) based on a combined corpus of texts and subtitles. The findings have now been accepted for publication in the Journal of Memory and Language. This is the pdf. Please refer to it as:

Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language, 92, 57-78.

More interestingly, Paweł also makes the semantic vectors available online and created an easy to use shell program and a web interface for those who feel not confident enough to program. So, now everyone can calculate the semantic distance (or semantic similarity) based on CBOW between any two words online in English and Dutch. More information can be found here.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Affective norms for 14,000 Spanish words

Hans Stadthagen-Gonzalez just made valence and arousal norms available for 14,000 Spanish words.

You find the norms here.

If you use the ratings, please refer to the article:

  • Stadthagen-Gonzalez, H., Imbault, C., Pérez Sánchez, M.A., & Brysbaert, M. (in press). Norms of Valence and Arousal for 14,031 Spanish Words. Behavior Research Methods. pdf

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

The Dutch Lexicon Project 2 made available

In the Dutch Lexicon Project, we collected lexical decision times for 14K monosyllabic and disyllabic Dutch words. The Dutch Lexicon Project 2 (DLP2) contains lexical decision times for 30K Dutch lemmas. These include almost all words regularly used in Dutch, independent of length.

The reference to this database is:

Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). The impact of word prevalence on lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance, 42, 441-458. pdf

These are files you may find interesting:

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Our paper on measuring vocabulary size and word prevalence is now in press

Our paper “Word knowledge in the crowd: measuring vocabulary size and word prevalence using massive online experiments” is now in press in The Quarterly Journal of Experimental Psychology.

The word prevalence values for 54,319 Dutch words in Belgium and the Netherlands used in this paper can be found on this page.

In this paper, we have analyzed part of the data from our online vocabulary test (http://woordentest.ugent.be) in which hundreds of thousands of people from Belgium and the Netherlands participated.

Important results from this paper:

  • Word prevalence, the proportion of people who know a word, appears to be the most important variable in predicting visual word recognition times in the lexical decision task. We conjecture that this is because word prevalence estimates the true occurrence of words better than word frequency in the low range.
  • A person’s vocabulary accumulates throughout life in a predictable way: the number of words known increases logarithmically with age.
  • This result mirrors the growth of the number of unique words encountered with the length of a text (known as Herdan’s law in quantitative linguistics). It is first demonstrated here for human language acquisition.
  • Knowing more foreign languages increases rather than decreases vocabulary in your first language. This is probably a result of the shared vocabulary between languages and the faster growth in  new types when acquiring a new language.

 

Word prevalence has been used for the analysis of the data from the Dutch Lexicon Project 2.

Words known in the UK but not in the US, and vice versa

Our vocabulary test keeps on doing well (over 600K tests completed now). Below is a list of 20 words known in the UK but not in the US, and a list of 20 words known in the US but not in the UK. By known we mean selected by more than 85% of the participants from that country with English as their native language. As you can see, for each word there is a difference of more than 50% between both countries.

Better known in the UK (between brackets, percent known in the US and percent known in the UK)

  • tippex (7, 91)
  • biro (17, 99)
  • tombola (17, 97)
  • chipolata (16, 93)
  • dodgem (17, 94)
  • korma (20, 97)
  • yob (22, 97)
  • judder (19, 94)
  • naff (19, 94)
  • kerbside (23, 98)
  • plaice (16, 91)
  • escalope (17, 91)
  • chiropody (20, 93)
  • perspex (22, 94)
  • brolly (24, 96)
  • abseil (15, 87)
  • bodge (18, 89)
  • invigilator (22, 92)
  • gunge (19, 89)
  • gormless (26, 96)

Better known in the US (between brackets, percent known in the US and percent known in the UK)

  • garbanzo (91, 16)
  • manicotti (90, 15)
  • kabob (98, 29)
  • kwanza (91, 24)
  • crawdad (86, 20)
  • sandlot (97, 32)
  • hibachi (89, 27)
  • provolone (97, 36)
  • staph (86, 25)
  • boondocks (96, 37)
  • goober (96, 37)
  • cilantro (99, 40)
  • arugula (88, 29)
  • charbroil (97, 39)
  • tamale (92, 35)
  • coonskin (88, 31)
  • flub (89, 31)
  • sassafras (92, 35)
  • acetaminophen (92, 36)
  • rutabaga (85, 30)

You can still help us to get more refined data by taking part in our vocabulary test. For instance, we have not enough data yet to say anything about differences with Canada, Australia, or any other country with English as an official language.

Words known by men and women

Some words are better known to men than to women and the other way around. But which are they? On the basis of our vocabulary test, we can start to answer this question (on the basis of the first 500K tests completed). These are the 12 words with the largest difference in favor of men (between brackets: %men who know the word, %women who know the word):

  • codec (88, 48)
  • solenoid (87, 54)
  • golem (89, 56)
  • mach (93, 63)
  • humvee (88, 58)
  • claymore (87, 58)
  • scimitar (86, 58)
  • kevlar (93, 65)
  • paladin (93, 66)
  • bolshevism (85, 60)
  • biped (86, 61)
  • dreadnought (90, 66)

These are the 12 words with the largest difference in favor of women:

  • taffeta (48, 87)
  • tresses (61, 93)
  • bottlebrush (58, 89)
  • flouncy (55, 86)
  • mascarpone (60, 90)
  • decoupage (56, 86)
  • progesterone (63, 92)
  • wisteria (61, 89)
  • taupe (66, 93)
  • flouncing (67, 94)
  • peony (70, 96)
  • bodice (71, 96)

These 24 words should suffice to find out whether a person you are interacting with in digital space is male or female.

Take part in our vocabulary test to make the results even more fine grained!

The 20 least known words in English

Now that over 480,000 vocabulary tests have been completed, we can have a look at some of the findings. For instance, which words are not known at all in English? The following are the words of which less than 3% of the participants in our test indicated they were English words. For comparison, the fake words were endorsed by 8.3% of the participants on average. So, these are words not only unknown to everyone but also unlikely to be ‘mistaken’ for a true English word. The funny thing is that they often have interesting meanings, including a weapon, a precious stone, animals, several descriptions of people, and so on.

Here they are, the 20 least known words of English, also the least liked words, cast aside by everyone!

You can still take part in our vocabulary test and contribute data.

AoA norms and Concreteness norms for 30,000 Dutch words

We have collected AoA norms and Concreteness norms for 30,000 Dutch words. If you use them, please refer to this publication:

  • Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80-84. pdf

Here you find the age of acquisition norms and the concreteness norms.

The AoA norms have been aggregated over the various studies that collected them (Ghyselinck et al., 2000, 2003; Moors et al., 2013; Brysbaert et al., 2014). If you cannot download the Excel files, most probably you are working with Internet Explorer. Ironically, this browser cannot read Microsoft Excel files.

Keywords Dutch: verwervingsleeftijd, concreetheid, voorstelbaarheid.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Pictures of tools with matched objects and non-objects

As part of his PhD thesis on laterality, Ark Verma has developed a set of pictures of tools with matched objects and nonobjects that look as follows (click on the picture to get a bigger image):

Example_pictures

You find the full set of pictures of tools, objects, and nonobjects here or here (svg format).

Please refer to the following article when you use the pictures. In this article you also find more information about them.

  • Verma, A., & Brysbaert, M. (2015). A validated set of tool pictures with matched objects and non-objects for laterality research. Laterality, 20, 22-48. pdf

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

De minst geliefde woorden in het Nederlands

Hoewel onze woordentest in de eerste plaats bedoeld was om de algemeen gekende woorden in kaart te brengen, is het perfect mogelijk om eens te kijken welke woorden helemaal niet herkend worden. Welke woorden werden door zo goed als niemand aangeduid? Welke woorden kregen nog minder ja-antwoorden dan de nepwoorden, niet alleen omdat ze door niemand gekend zijn, maar ook omdat niemand ze Nederlands genoeg vindt klinken om een gokje te wagen? Welke zijn deze weeswoorden, de kneusjes van de Nederlandse taal?

Verwacht kan worden dat de lijst dier- en plantsoorten zal bevatten, die alleen door maniakale biologen gekend zijn. Inderdaad, tussen de minst geliefde woorden vinden we een tropische loofboom (knippa) die nochtans lekkere vruchten geeft. We vinden er ook een twatwa (zwarte, vinkachtige vogel uit Suriname), alkanna (de struik die ons de rode hennakleur geeft) en sfagnum (een soort veenmos). Verder is er de kamsin (een verschroeiende wind uit de Sahara die je maar beter kunt vermijden) en een gerenoek (een ranke giraffengazelle). Tot slot blijkt ook de chijl niet gekend, ook al hebben we die nodig voor een goede darmwerking.

Geologie is vertegenwoordigd door twee tijdperken: het eemien en het ypresien. Jammer toch, want het eerste is vernoemd naar het riviertje de Eem in Utrecht en het tweede naar de stad Ieper.

Een aantal woorden uit Indonesië en Suriname blijken er ook niet bij te horen. We hadden al de twatwa. Verder zijn er nog de golok (een soort kapmes) en de romusha (Indonesische dwangarbeider). De mosjav (een Israëlische nederzetting) lijkt eveneens haar beste tijd gehad te hebben. En sommige woorden uit de islam zijn evenmin gekend. Zo hebben we moekim (een kring van ingezetenen in de moskee) en hoedna (nochtans goed om te kennen, want het gaat om een wapenstilstand met een niet-islamitische vijand).

Werktuigen, dranken en textiel die niet meer gebruikt worden, zijn een andere bron van weeswoorden. Niemand kent nog wem (het verbrede uiteinde van een ankerarm), fijfel (een dwarsfluit), saguweer (soort palmwijn), fep (een woord voor sterkedrank), dawet (een niet-alcoholische drank), of falbala (soort boordsel aan vrouwenkledij of gordijnen). Ook een ojief (een profiel dat onderaan hol is bovenaan bol, of omgekeerd) wordt als een exoot beschouwd.

Misschien wat vreemder dat niemand een ghazel kent (dichtvorm in tweeregelige strofen), of bisbilles (gekibbel), of giegagen (balken, schreeuwen als een ezel), of goëtie (zwarte magie). Toch perfecte woorden voor een tekst of een gedicht?!

Hier is hij dan: de lijst met de minst geliefde woorden in het Nederlands. Door niemand herkend en ook door niemand gezien als een potentieel familielid van onze taal. De woorden die door iedereen aan de kant geschoven worden. (Kijk hier voor meer uitleg bij elk woord.)

  • knippa
  • twatwa
  • alkanna
  • sfagnum
  • kamsin
  • gerenoek
  • chijl
  • eemien
  • ypresien
  • golok
  • romusha
  • mosjav
  • moekim
  • hoedna
  • wem
  • fijfel
  • saguweer
  • fep
  • dawet
  • falbala
  • ojief
  • ghazel
  • bisbilles
  • giegagen
  • goëtie

Geïnteresseerden kunnen nog altijd meedoen aan de woordentest.

De volledige resultaten van het Groot Nationaal Onderzoek Taal kun je nu ook in boekvorm kopen.

Our English vocabulary test (wordORnot) is online now

After the success of our Dutch vocabulary test, we’ve developed an English version (wordORnot). The task is the same: you get 100 letter sequences and you have to indicate which are existing English words and which not. Guessing is discouraged, because you are penalized if you say “yes” to a nonword.

Our experiences with the Dutch vocabulary test show us that in the beginning there are some questionable (not to say bad) nonwords and words (for which we apologize). These are next to unavoidable given that we are using so many stimuli. However, on the basis of the responses and the feedback we get (an example of crowdsourcing), the lists are regularly updated, so that after a few days / weeks (depending on the popularity of the test) these problematic cases should be gone. In general, problematic words or nonwords should not change the score by more than 5%.

Enjoy the English Vocabulary Test!

See the Twitter trail here.

Read the first forum discussions after the launch of the test here (UK), here and here (USA).

Updates

  • Jan 31, 2014: After two days the test has been done 50K times already with lots of feedback

  • Feb 1, 2014: 100K tests completed

  • Feb 16, 2014 : 200K

  • May 20, 2014 : 480K. First cleaning of the lists. Words out: 300 problematic words (the letters, abbreviations, and some long compound words that are usually written in two words) plus 2,300 very low frequency derived words ending on -ness or -ly (we had too many of them). Words in: 1,300 words from a new frequency list (many science related words). Nonwords out: 8,000 with false acceptance rates of more than 33% (such as ammicably, peachness, ….). Nonwords in: 22,000 nonwords that look like science words or monosyllabic nonwords from the ARC nonword database (because many of the nonwords that had to be dropped were monosyllabic).

Here you find the Dutch test (woordentest).

De resultaten van de Woordentest 2013

Tussen 16 maart 2013 en 15 december 2013 werd een Groot Nationaal Onderzoek Taal georganiseerd door ons centrum, de Universiteit Gent en de Nederlandse omroepen NTR en VPRO in samenwerking met de NWO. Hier zullen wij in het vervolg naar verwijzen onder de noemer Woordentest 2013.

Aan de deelnemers werd gevraagd om een proefje van zo’n 4 minuten af te werken. Elk proefje bestond uit het aanbieden van 100 letterreeksen (één na één), waarbij de deelnemers telkens moesten beslissen of het om een gekend Nederlands woord ging of niet. Om gisgedrag te ontmoedigen, waren een 30-tal letterreeksen nepwoorden en ging de score omlaag als op deze nepwoorden “ja” gezegd werd.

De resultaten werden kenbaar gemaakt in Labyrint uitzendingen op Nederland2 (zondag 15 december) en CANVAS (maandag 16 december).

Je kunt er ook een boek over kopen.

Rapport met bevindingen

In dit rapport staan de bevindingen beschreven.

Here you find an English summary on the basis of a talk we gave in Leiden for computational linguists (CLIN24)

Samenvatting

Dit waren de belangrijkste resultaten:

  • Dit rapport beschrijft de belangrijkste bevindingen van het Groot Nationaal Onderzoek Taal, georganiseerd tussen 16 maart 2013 en 15 december 2013 door de Universiteit Gent en de Nederlandse omroepen NTR en VPRO in samenwerking met de NWO.

  • Elke test bestond uit het aanbieden van 100 letterreeksen (één na één), waarbij de deelnemer telkens moest beslissen of het om een gekend Nederlands woord ging of niet. Om gisgedrag te ontmoedigen, waren een 30-tal letterreeksen nepwoorden en ging de score omlaag als op deze nepwoorden “ja” gezegd werd.

  • Omdat 735 verschillende lijsten gebruikt werden, kunnen we uitspraken doen over bijna 53.000 Nederlandse woorden.

  • Ruim 600.000 tests werden afgelegd door iets minder dan 400.000 deelnemers (bijna 2% van de Nederlandstalige populatie). Hiervan kwamen 212.000 deelnemers uit Nederland en 180.000 uit België. Vlamingen hebben dus proportioneel gezien meer deelgenomen.

  • Er waren drie types van deelnemers: 76% nam één keer deel, 20% deed de test een paar keer en stopte bij een hogere score. De resterende 4% deed de test minstens 10 keer (met een maximum van 489 keer). Dit waren gewoonlijk mensen die met een hoge score begonnen en dus een grote interesse voor de Nederlandse taal hebben.

  • De meest voorkomende score is 75,5%. Er is echter een duidelijk effect van leeftijd. De woordenschat groeit constant tussen 12 en 80 jaar (de uitersten die we konden testen): 12-jarigen kennen gemiddeld 50% van de woorden, 80-jarigen gemiddeld 80% van de woorden. Dit is een verschil van bijna 16.000 woorden.

  • Er is ook een effect van opleidingsniveau: hoe hoger het behaalde diploma, hoe meer woorden men gemiddeld kent.

  • Er is een verschil van 1,5% tussen Nederland en België in het voordeel van Nederland. Dit verschil komt door de lagere scores in België dan in Nederland bij de deelnemers ouder dan 40 jaar.

  • Deelnemers die naast het Nederlands als moedertaal meerdere talen spreken, kennen een groter aantal Nederlandse woorden. Het effect is cumulatief: wie vier talen spreekt, kent meer Nederlandse woorden dan wie drie talen spreekt, en wie drie talen spreekt, kent meer Nederlandse woorden dan wie twee talen spreekt.

  • Nederlanders en Vlamingen hebben een gedeelde woordenschat van 16.000 woorden (gekend door 97,5% van alle deelnemers). Volgens hetzelfde criterium kennen Vlamingen 2.000 extra woorden en Nederlanders 5.000 extra woorden. Hiervan zijn er 1.250 typisch Zuid-Nederlandse woorden (zoals foor en pagadder) en 1.900 typisch Noord-Nederlandse woorden (kliko, vlaflip en salmiak). Er is dus een grotere gedeelde woordenschat in Nederland dan in België.

  • Sommige woorden worden beter herkend door mannen dan door vrouwen en omgekeerd (bijv. mandekker vs. sleehak).

  • De scheidingslijn qua taal ligt duidelijk op de landsgrens. De Nederlandse en Belgische provincies vormen twee aparte clusters als gekeken wordt naar de overeenkomsten in woordenkennis tussen de provincies.

Lijsten

De lijsten die hieronder staan, zijn voorlopig om drie redenen:

  1. Ze zijn gebaseerd op 370 duizend deelnemers tot eind oktober (terwijl we hopen er 500 duizend te hebben aan het einde van het jaar).

  2. Het gaat om eenvoudige gemiddelden. Die houden geen rekening met individuele verschillen in gisgedrag. Om hiervoor te corrigeren, moeten we een Rasch-analyse doen, maar die zal tijd vergen, gezien de omvang van de database.

  3. Bij de laatste aanpassing begin december 2013 zijn 3000 nieuwe (langere) woorden toegevoegd. Die maken nog geen deel uit van de lijsten hieronder.

Bestanden

  • Woordenkennis Nederland vs. België (Excel, tekst)

  • Woordenkennis mannen-vrouwen in Nederland-België (Excel, tekst)

  • Woordenkennis leeftijd in Nederland-België (Excel, tekst)

  • Woordenkennis opleidingsniveau Nederland-België (Excel, tekst)

  • Woordenkennis provincies waarvan meer dan 7,500 deelnemers (Excel, tekst)

  • Accuraatheid nepwoorden Nederland-België (Excel, tekst)

  • Accuraatheid nepwoorden per provincie waarvan meer dan 7,500 deelnemers (Excel, tekst)

  • Lijst van woorden die in de drie herzieningen uit de lijst weggehaald werden wegens niet meer gebruikt, niet juist gespeld of te verwarrend met de nepwoorden (Excel, tekst)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Resultaten auteurstest beschikbaar voor Vlaanderen

Doordat de auteurstest via kranten (in het bijzonder De Standaard) kenbaar gemaakt werd, hebben we veel vlugger resultaten voor Vlaanderen dan gedacht. Op basis van de eerste week zijn dit de belangrijkste bevindingen:

  • Twintigduizend Vlamingen en vijfduizend Nederlanders namen deel aan de auteurstest. Omdat het aandeel van de Nederlands te klein is, beperken we de analyse voorlopig tot de Vlaamse deelnemers. Hopelijk zijn er binnenkort genoeg antwoorden uit Nederland.

  • De meeste deelnemers waren lezers van kwaliteitskranten en behoren tot het publiek waar uitgeverijen zich vooral op richten.

  • Herman Brusselmans is de auteur met de grootste naambekendheid in Vlaanderen. Hij werd door alle deelnemers herkend. Daarna volgen J.R.R. Tolkien, Hugo Claus, William Shakespeare, Dimitri Verhulst en Bart Moeyaert met 99% naambekendheid.

  • Slechts 55 namen werden door meer dan 90% van de deelnemers herkend. Hiertoe behoren 22 namen van Belgische auteurs, 10 namen van Britse schrijvers, 5 namen van Amerikaanse, Franse en Nederlandse schrijvers, en één naam uit Colombia, Denemarken, Duitsland, Griekenland, Italië, Rusland, Zweden en Zwitserland.

  • De lijst bevat een aantal namen van auteurs die wellicht niet in de eerste plaats door hun boeken gekend zijn, maar over wie onderwezen wordt op school of die een prominente plaats hebben in de media. Verder interessant is dat de lijst ook jeugdauteurs en auteurs van stripverhalen bevat.

  • Minder dan 500 bijkomende auteurs worden door de helft van de deelnemers herkend; 80% wordt door minder dan een vierde van de deelnemers herkend.

  • De naambekendheid van bijna 15 duizend auteurs in Vlaanderen kan opgezocht worden in dit rapport dat we geschreven hebben. Deze gegevens zijn ook in een Excel bestand beschikbaar.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

SUBTLEX-UK: Subtitle-based word frequencies for British English

Attentive readers may have noticed that we have underused the data from the British Lexicon Project in our publications thus far, focusing more on the (American) English Lexicon Project. This was because we felt uneasy about using word frequencies from American English to predict word processing times in British English.

At long last, together with Walter van Heuven from Nottingham University, we now have analysed word frequency norms for British English based on subtitles: SUBTLEX-UK.

As expected, these norms explain 3% more variance in the lexical decision times of the British Lexicon Project than the SUBTLEX-US word frequencies. They also explain 4% more variance than the word frequencies based on the British National Corpus, further confirming the superiority of subtitle-based word frequencies over written-text-based word frequencies for psycholinguistic research. In contrast, the word frequency norms explain 2% variance less in the English Lexicon Project than the SUBTLEX-US norms.

The SUBTLEX-UK word frequencies are based on a corpus of 201.3 million words from 45,099 BBC broadcasts. There are separate measures for pre-school children (the Cbeebies channel) and primary school children (the CBBC channel). For the first time we also present the word frequencies as Zipf-values, which are very easy to understand (values 1-3 = low frequency words; 4-7 = high frequency words) and which we hope will become the new standard.

You can do online searches in the database here.

You also find lists with the word frequencies here:

  • SUBTLEX-UK: A cleaned Excel file with word frequencies for 160,022 word types (also available as a text file). This file is ideal for those who want to use British word frequencies.
  • SUBTLEX-UK_all: An uncleaned Excel file with entries for 332,987 word types, including numbers. To be used for entries not in the cleaned version.
  • SUBTLEX-UK_bigrams: A csv-file with information about word pairs. Contains nearly 2 million lines of information and, hence, cannot be opened in a simple Excel file.

Further information about the collection of the SUBTLEX-UK word frequencies can be found in the article below (please cite):

Van Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. pdf

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Symposium ‘Dyslexie in het hoger onderwijs’

Binnenkort organiseren we een symposium over dyslexie in het hoger onderwijs op basis van de bevindingen uit ons grootschalig dyslexie-onderzoek.

Hier vind je meer informatie.