Part-of-Speech information added to the SUBTLEX-US word frequencies

We have now tagged the SUBTLEX-US corpus with the CLAWS tagger, so that we can add Part-of-Speech (PoS) information to the SUBTLEX-US word frequencies. Five new columns have been added to the file:

  1. The dominant (most frequent) PoS of each entry
  2. The frequency of the dominant PoS
  3. The relative frequency of the dominant PoS
  4. All PoS observed for the entry
  5. The frequency of each PoS

You find more information about the tagging in Brysbaert, New, & Keuleers (Behavior Research Methods, in press).

You find a zipped Excel version of the SUBTLEX-US word frequency file with PoS information here.

You find a zipped text version of the file here.

You find more information about the SUBTLEX-US word frequencies here.

Here you find a demo on how to easily enter SUBTLEX information into your stimulus Excel file.

After publication of the files Kati Renvall alerted us to the fact that verb abbreviations (like ll, couldn, and doesn) are classified as predominantly Nouns. A look at the columns B (FREQcount) and N (All_freqs_SUBTLEX) shows why this is the case. Of the 224,097 times ll was observed in the corpus, only 1,312 remained after parsing (because the other were translated to will and shall). Of the 1,312 remaining 1,290 were classified as noun and 22 as name. Hence, why in the processed file the dominant PoS of ll is listed as Noun. Thanks for this feedback! It shows how careful one must be with the outcome of algorithms. We intend to correct these entries manually in future versions. In the meantime, always compare the frequencies of the parsed entries (column N) with those of the initial count (column B), to make sure the dominant PoS indees applies to the majority of cases!

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Comments are closed.