SUBTLEX-UK: Subtitle-based word frequencies for British English

Attentive readers may have noticed that we have underused the data from the British Lexicon Project in our publications thus far, focusing more on the (American) English Lexicon Project. This was because we felt uneasy about using word frequencies from American English to predict word processing times in British English.

At long last, together with Walter van Heuven from Nottingham University, we now have analysed word frequency norms for British English based on subtitles: SUBTLEX-UK.

As expected, these norms explain 3% more variance in the lexical decision times of the British Lexicon Project than the SUBTLEX-US word frequencies. They also explain 4% more variance than the word frequencies based on the British National Corpus, further confirming the superiority of subtitle-based word frequencies over written-text-based word frequencies for psycholinguistic research. In contrast, the word frequency norms explain 2% variance less in the English Lexicon Project than the SUBTLEX-US norms.

The SUBTLEX-UK word frequencies are based on a corpus of 201.3 million words from 45,099 BBC broadcasts. There are separate measures for pre-school children (the Cbeebies channel) and primary school children (the CBBC channel). For the first time we also present the word frequencies as Zipf-values, which are very easy to understand (values 1-3 = low frequency words; 4-7 = high frequency words) and which we hope will become the new standard.

You can do online searches in the database here.

You also find lists with the word frequencies here:

  • SUBTLEX-UK: A cleaned Excel file with word frequencies for 160,022 word types (also available as a text file). This file is ideal for those who want to use British word frequencies.
  • SUBTLEX-UK_all: An uncleaned Excel file with entries for 332,987 word types, including numbers. To be used for entries not in the cleaned version.
  • SUBTLEX-UK_bigrams: A csv-file with information about word pairs. Contains nearly 2 million lines of information and, hence, cannot be opened in a simple Excel file.

Further information about the collection of the SUBTLEX-UK word frequencies can be found in the article below (please cite):

Van Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. pdf

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Comments are closed.