Archive by Author

Our paper on measuring vocabulary size and word prevalence is now in press

Our paper “Word knowledge in the crowd: measuring vocabulary size and word prevalence using massive online experiments” is now in press in The Quarterly Journal of Experimental Psychology.

The word prevalence values for 54,319 Dutch words in Belgium and the Netherlands used in this paper can be found on this page.

In this paper, we have analyzed part of the data from our online vocabulary test (http://woordentest.ugent.be) in which hundreds of thousands of people from Belgium and the Netherlands participated.

Important results from this paper:

  • Word prevalence, the proportion of people who know a word, appears to be the most important variable in predicting visual word recognition times in the lexical decision task. We conjecture that this is because word prevalence estimates the true occurrence of words better than word frequency in the low range.
  • A person’s vocabulary accumulates throughout life in a predictable way: the number of words known increases logarithmically with age.
  • This result mirrors the growth of the number of unique words encountered with the length of a text (known as Herdan’s law in quantitative linguistics). It is first demonstrated here for human language acquisition.
  • Knowing more foreign languages increases rather than decreases vocabulary in your first language. This is probably a result of the shared vocabulary between languages and the faster growth in  new types when acquiring a new language.

 

Comments on ‘Orthographic Processing in Baboons (Papio papio)’

We were recently asked by Nature News to comment on the Science paper by Grainger and colleagues showing that baboons can acquire orthographic processing skills, and to clarify its relation to human orthographic processing skills. I wrote up some comments, of which  Nature published just one in their article “Baboons can learn to recognize words“, but they were kind enough to link to our website, so I’m posting the remainder here.

What Grainger and colleagues have shown, is that baboons can learn the ‘written fingerprint’ of a language without knowing the language.

For English speakers, it may be more intuitive to imagine the task the baboons were given if they consider that they are themselves doing an experiment in a language they don’t know. For instance, imagine that you are seated in front of a computer screen. You are then presented with a letter sequence which either is an existing word in the Basque language, or a distractor letter sequence (a nonword), and you have to decide which is which. Since Basque is different from all other languages you know, you have to guess, and you are told whether your guess is correct or not. However, after some trials you start seeing that there are similarities between the letter sequences you are presented with and letter sequences you have previously seen. Based on the feedback you get, you start making informed guesses about which stimuli are Basque words and which are nonword distractors.

The difficulty of this task depends on the kind of distractor stimuli. Below you’ll see a sequence of Basque five-letter words, mixed with five-letter distractors which are just random sequences of alphabetic letters. It’s easy to find out which words are Basque, because the distractors have no relation to the Basque orthographic patterns.

ezfec erosi tafqp ontsa wlftk
eurak edkzt tjtsj pjfwl puska
pscwf cobbf busti gosez medio

(bold: words, regular: nonwords)

Now imagine that you have to do the same, but with the following sequence. This is much harder, because the nonwords are derived from the same orthographic patterns as the words.

ordez oinez salmo koroa oirat
gorga adere eupez surda halbo
zerga berme agiri gekal edeti

(bold: words, regular: nonwords)

What the baboons did, had a degree of difficulty in between the first task and the second task. The nonwords were composed primarily of bigrams (letter pairs) which occur very rarely in English words, while the words were composed primarily of bigrams which occur very often in English words. So, the baboons learned to discriminate between orthographically very typical English letter strings and  orthographically very atypical English letter strings. What’s more, Grainger and colleagues also showed that the less similar nonwords were to previously presented words, the higher the probability was that the baboons would make a nonword response.

Grainger and colleagues also analyzed data from the British Lexicon project, a very large experiment that we published recently (Keuleers, Lacey, Rastle & Brysbaert, 2012, [open access]) and found traces of the same behavior for humans. In our experiment, each of 78 participants responded to nearly 30.000 trials, deciding whether a presented sequence was an English word or not. Of course, the main difference was that our participants knew most of the English words. Therefore, they didn’t have to rely on the statistical regularities in orthographic patterns to make a decision. In contrast to Grainger and colleagues, we also made it exceptionally difficult for our participants to distinguish between the words and the nonwords based on these orthographic patterns. Still, as we  also reported earlier (Keuleers & Brysbaert, 2011 [preprint]), Grainger and colleagues found that, in addition to their knowledge of English, and despite extreme efforts, our participants partly relied on the orthographic similarities between the current stimulus and the previously presented stimuli to decide whether a stimulus was a word or not.

The new study adds to the evidence that orthographic processing can occur without linguistic processing. More importantly, showing this in baboons demonstrates that orthographic processing can be independent of the capacity to acquire high-level linguistics skills.

The new findings don’t have immediate practical use. However, they do have implications for research in language acquisition, bilingualism, visual word recognition, emotional processing, executive control, and many other fields, where word/nonword decision experiments are used very often with human participants . Usually, the reaction time to words is the variable of interest. The basic assumption in all of these experiments is that the meaning of the presented words is activated when making a decision. Now, if such an experiment can accurately be performed by baboons, it is clear that that experiment does not require accessing the meaning of those words, and results are tainted. Therefore, in ordinary experiments, the nonwords must be meticulously chosen so that the differences between words and nonwords is minimized. We have written a free application called Wuggy to do that (http://crr.ugent.be/Wuggy). It is used by researchers to generate nonwords that match the orthographic patterns in words as closely as possible, for languages from English to Vietnamese (Keuleers & Brysbaert, 2010 [preprint]).

Since not everyone has access to baboons to check whether their experiment is valid,  we have also written an algorithm (Keuleers & Brysbaert, 2011 [preprint][code]) that tries to perform this type of experiment as accurately as possible without knowing the language. The mechanism used by the algorithm (exemplar-based learning) is very similar to the one that I hypothesize is used by the baboons. We intend to look at Grainger and colleagues’ data to see how similar they are .

References

Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods, 42(3), 627-633. [preprint]

Keuleers, E., & Brysbaert, M. (2011). Detecting inherent bias in lexical decision experiments with the LD1NN algorithm. The Mental Lexicon, 6 (1). [preprint]

Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287-304, doi: 10.3758/s13428-011-0118-4 [open access]

Visual Word Recognition (vwr) package for R released

We have just released vwr, an R package to assist in computations often needed in visual word recognition research.

The manual for the package can be found here.

Vwr includes functions to:

  • Compute levenshtein distances between strings
  • Compute hamming distances (overlap distance) between strings
  • Compute neighbors based on the levenshtein and hamming distance
  • Compute Coltheart’s N and average levenshtein distances (e.g., Yarkoni et al’s OLD20 measure). These functions run in parallel on multiple cores and offer a major speed advantage when computing these values for large lists of words.

The package also includes the ldknn algorithm, a method that we recently proposed to examine  how balanced a lexical decision task is (i.e., how easy it is to discriminate the words from the nonwords in an experiment given no other information than the stimuli in the experiment). A preliminary version of that paper can be found here.

As the package is listed on CRAN, it can be installed like any other official package for R.

We’d be happy to have your feedback. As this is just a first version, it is guaranteed to be not error free. Also, feel free to pass it on to interested colleagues for testing.

New Wuggy Version with Vietnamese language module

We just released a new version of Wuggy, our nonword generator. This version includes a Vietnamese language module (thanks to Hien Pham), and many improvements to the German language module.

Understanding the Part of Speech (PoS) information in SUBTLEX-NL

Marc Brysbaert & Emmanuel Keuleers

In processing the subtitle corpus on which SUBTLEX-NL was based, we used the wonderful Tadpole software made freely available by Tilburg University’s ILK lab. Tadpole allowed us to distinguish meaningful word units from punctuation in the text (tokenizing), to tag these units according to their Part of Speech, such as Noun, Verb, Adjective (PoS tagging), and to group related forms under a single heading, as would be done in a dictionary  (lemmatizing), for instance, a verb with its present tense and past tense forms, a noun with its plural and diminutive forms, etc.

[…]

Hoe vaak worden Nederlandse woorden gebruikt ?

Hoe vaak worden Nederlandse woorden gebruikt?

Een nieuwe, gemakkelijk te gebruiken database van woordfrequenties

Marc Brysbaert   Emmanuel Keuleers

Vakgroep Experimentele Psychologie, Universiteit Gent

Marc.brysbaert@ugent.be

Onderzoek heeft aangetoond dat we het snelst stimuli herkennen die we vroeg geleerd hebben en die we dikwijls gezien hebben. Dit heeft belangrijke gevolgen voor het leren van nieuw materiaal (bijv. bij het leren lezen of het leren van een nieuwe taal). De eerste woorden die we leren en de woorden die we het meest zien en horen, onthouden we het best. Dit is goed nieuws, omdat het leerkrachten en hulpverleners handvaten geeft om het leerproces te optimaliseren. Recent onderzoek in ons centrum laat ons vermoeden dat we een woord al beduidend sneller herkennen als we het 20 keer gelezen hebben dan als we het nog maar 10 keer gelezen hebben. Het zogenaamde woordfrequentie-effect is dus geen effect dat alleen maar van belang is bij woorden die we al honderden keren gezien hebben!

Welke zijn dan de meest frequente woorden in het Nederlands? Om hier een antwoord op te krijgen, hebben onderzoekers teksten ingescand en het aantal woorden geteld. De bekendste woordfrequentielijst is de Celexlijst, samengesteld door het Max Planck Instituut in Nijmegen in de jaren 1980-1990. Jammergenoeg is deze lijst nogal moeilijk te gebruiken (zie op  http://celex.mpi.nl/) en bovendien zijn de woordfrequenties niet altijd juist, omdat ze op nogal oude teksen voor volwassenen gebaseerd zijn.

We krijgen betere woordfrequentiematen als we ondertitels van films en televisieprogramma’s gebruiken. Dit is zo in alle talen die we getest hebben: het Frans, het Engels, het Chinees, het Spaans, en ook het Nederlands. Wanneer we kijken welke woorden studenten goed en snel kunnen lezen, dan zijn de woordfrequenties op basis van ondertitels een betere voorspeller dan de Celexfrequenties. Meer hierover kun je lezen in Keuleers, Brysbaert, & New (2010). Een bijkomend voordeel is dat de woordfrequenties op basis van ondertitels vrij beschikbaar zijn en gedownload kunnen worden vanop het internet of zelfs rechtstreeks opgevraagd kunnen worden. De Excelfile ziet er als volgt uit:

De woorden zijn geordend van hoogfrequent naar laagfrequent (in ondertitels komen “ik” en “je” dus het vaakst voor). De betekenis van de kolommen is als volgt:

  • FREQcount: het aantal keren dat het woord voorkomt in het corpus (op een totaal van 43.8 miljoen woorden).
  • CDcount: het aantal films/programma’s waarin het woord voorkomt (op een totaal van 8070).
  • FREQlow en CDlow : zelfde informatie maar wanneer het woord begint met een kleine letter. Dit geeft belangrijke informatie over welke woorden vooral met een hoofdletter beginnen. Dit is bijvoorbeeld zo voor “wat” (begint 220 duidend keren met een kleine letter en 260 duizend keren met een hoofdletter).
  • FREQlemma: de som van de verbuigingen (bijv. voor “koord” is dat de som van “koord” en “koorden”).
  • SUBTLEXWF: frequentie per 1 miljoen woorden.
  • Lg10WF : logaritme van de frequentie.
  • SUBTLEXCD : frequentie per 100 documenten.
  • Lg10CD : logartime van het aantal documenten.

Gebruikers die graag meer weten over het gebruik van woorden in hun verschillende syntactische rollen (bijv. “spelen” als werkwoord vs. als zelfstandig naamwoord) vinden hun gading in de file met lemmas en woordvormen. Je kunt hier ook rechtstreeks de frequenties vinden van de woorden die je interesseren (zie ook hier voor meer uitleg over de online opzoekingen).

De gegevens van het ondertitelcorpus leveren een paar contra-intuïtieve inzichten op. Zo blijkt dat we met een woordenschat van 61 woorden al de helft van de woorden kunnen verstaan die mensen in conversaties gebruiken! Een op de vijf woorden die we zeggen, zijn “ik”, “je”, “het”, “de”, “dat”, “is”, “niet”, of “een”. Met een woordenschat van 1000 woorden begrijp je al 82% van de woorden die gezegd worden. De percentages voor 2000 en 3000 woorden zijn respectievelijk 87% en 90% (wat wel betekent dat we met een woordenschat van 3000 woorden nog altijd 1 op de 10 woorden niet zullen begrijpen). Inzicht in dergelijke wetmatigheden zorgt ervoor dat we ons onderwijs en onze hulpverlening sterk kunnen optimaliseren. Het is immers veel interessanter om een relatief beperkte groep van vaak gebruikte woorden goed en herhaaldelijk in te oefenen dan om een een lange lijst slechts eenmaal door te nemen.

Zo zijn er in het Engels zo’n 400 onregelmatige werkwoorden (waarvan een aantal samengesteld zijn, zoals “outrun”). Een oninteressante manier om die aan te leren is om ze allemaal te willen geven. Veel van die werkwoorden komen immers zo goed als niet voor in het alledaagse taalgebruik (bijv., abide, alight, backslide, befall, beget, behold, bend, bereave, beseech, …). Een veel interessantere onderwijsmethode is om uit te gaan van de vraag hoe vaak een leerling een woord nodig zal hebben en dan te zien welke werkwoorden de belangrijkste zijn (voor Engelse ondertitelfrequenties, zie http://expsy.ugent.be/subtlexus). Dan blijkt dat een lijst van zo’n 50 werkwoorden het overgrote deel van alle gebruikte werkwoorden bevat. De volgende werkwoorden zouden zeker in de lijst voorkomen: be, have, do, know, get, like, go, come, think, see, let, take, tell, say, make, give, find, put, think, win, keep, feel, make, leave, hear, show, understand, hold, meet, run, bring. Het is veel belangrijker om deze woorden goed in te oefenen dan om de volledige lijst van werkwoorden half te kennen.

SUBTLEX frequencies are a powerful predictor of number processing speed

Our colleague Ineke Imbo recently used the SUBTLEX-NL word frequencies in her research on number processing and found that the SUBTLEX-NL frequencies for the numbers 1-99 are the most powerful predictor of their naming latencies.


Table: Correlation between number naming latencies and several frequency measures. All frequency measures were log transformed.

SUBTLEX-NL number frequencies (Keuleers, Brysbaert, & New, 2010) SUBTLEX-NL number word frequencies (Keuleers, Brysbaert, & New, 2010) GOOGLE number frequencies (Verguts & Fias, 2006) ESTIMATED number frequencies (Gielen et al., 1991)
Dutch Naming Latencies (L1) -.537 -.518 -.513 -.427
English Naming Latencies(L2) -.609 -.642 -.539 -.570

Twenty students at Ghent University (Belgium) (sixteen females, four males; mean age 19.7) named all numbers from 1 to 99 in two blocks. In one block they named the numbers in Dutch (L1) and in the other block they named the numbers in English (L2). Each number was presented four times per block. A trial started with a fixation point (500 ms), after which the number was presented in Arabic digits. Timing began when the stimulus appeared and ended when the participant’s response triggered the sound-activated relay. Invalid trials due to naming errors or faulty time registration were removed. The table below shows the correlations between the Dutch and English naming latencies, on the one hand, and the different frequency measures, at the other hand. The log-transformed frequencies from the SUBTLEX-NL database turn out to correlate best with the observed naming latencies – and this is true for both languages (L1 and L2).