Understanding the Part of Speech (PoS) information in SUBTLEX-NL

Marc Brysbaert & Emmanuel Keuleers

In processing the subtitle corpus on which SUBTLEX-NL was based, we used the wonderful Tadpole software made freely available by Tilburg University’s ILK lab. Tadpole allowed us to distinguish meaningful word units from punctuation in the text (tokenizing), to tag these units according to their Part of Speech, such as Noun, Verb, Adjective (PoS tagging), and to group related forms under a single heading, as would be done in a dictionary  (lemmatizing), for instance, a verb with its present tense and past tense forms, a noun with its plural and diminutive forms, etc.

All this information can be found in raw format in the SUBTLEX-NL master files, and detailed description about the linguistic interpretation of the PoS tags can be found here (document in Dutch).

For psycholinguists working in domains such as visual word recognition, who are more accustomed to working with surface forms (letter strings), we have integrated part of the information from the master files directly in the lists with surface form frequencies. In addition to the lemma frequency, which was already available in the original version of these files, we have now added the frequencies with which each letter string occurs as a particular part of speech. The figure below shows the full frequency information for the letter string “spel”.

Figure 1: Frequency information for the letter string "spel"

Figure 1: Frequency information for the letter string "spel" (click to enlarge)

To correctly understand these frequency measures, it is important to grasp what is meant by lemma in this context. Like an entry in a dictionary, a lemma groups several word forms under a single heading. Let us look at the letter string “spel”. In the SUBTLEX-NL.master.txt file, we find this letter string under four different entries (see Figure 2). This means the letter string “spel“occurs as a word form of four different lemmas.

Tadpole distinguishes the following parts of speech, or lemma types:

  • Noun (N)
  • Verb (WW)
  • Adjective (ADJ)
  • Adverb (BIJW)
  • Numeral (TW)
  • Pronoun (VNW)
  • Determiner (LID)
  • Preposition (VZ)
  • Conjunction (VG)
  • Interjection (TW)
  • Special (SPEC; these are often personal or geographical names)
Figure 2: Information for the letter string "spel" in the SUBTLEX-NL.master.txt file

Figure 2: Information for the letter string "spel" in the SUBTLEX-NL.master.txt file (click to enlarge)

The first lemma is the verb “spellen” (to spell). The code in the POS column is WW, short for “werkwoord“ (verb). This lemma groups seven different wordforms, of which one corresponds to the letter string “spel”. The codes in the SubPOS column give more information about the wordform (pv: persoonsvorm [person], tgw: tegenwoordig [present], ev: enkelvoud [singular]). In short, ‘spel’ occurs as thefirst person singular form of the verb spellen with a frequency of 91 in the SUBTLEX-NL corpus. Summing all the different wordforms of the lemma spellen, gives a lemma frequency of 477.

The second lemma is the noun “spel” (game), indicated by the POS code N (Noun). The letter string “spel” is listed twice here, once as a singular noun with neutral gender, with frequency 4054, and once as a singular noun with specific gender, with frequency 2. (Note that the exhaustive interpretation of the SubPOS codes can be found here (document in Dutch).

Finally, the letter string “spel“  is shown under two other lemmas. Since it is difficult to imagine that “spel“ occurs as an adjective (ADJ) in Dutch, this is probably an inappropriate tag by Tadpole. Its very low frequency shows that this is a highly unusual error. The last entry has a “special” (SPEC) code, meaning that no other PoS code was appropriate. This code is mostly used for names. There are 9 occurrences of the letters string as part of a name (similar to “New” in “New York”) and 2 occurrences as a foreign (non Dutch) form. The FREQlow column also shows that only 1 of the 11 occurrences was a lowercase form, meaning that the other 10 occurrences were capitalized, which is typical for names.

The PoS related information the SUBTLEX.NL.master.txt file (Figure 2) gives us for the letter string “spel” is summarized as follows in the SUBTLEX.NL.txt file (Figure1):

  • The letter string “spel” occurs most frequently as a noun (dominant.pos).
  • The frequency of “spel” as a noun is the sum of the entries corresponding to the letter string grouped by the noun lemma “spel” (4054+2=4056) (dominant.pos.frequency).
  • The lemma frequency for the dominant PoS  is the sum of all the wordforms grouped under that lemma 5235 (dominant.pos.lemma.frequency).
  • The total lemma frequency of “spel” is the sum of the frequencies given for each lemma: 477+5235+1+11= 5724 (FREQLemma)
  • All the parts of speech, in decreasing  order of frequency are .N.WW.SPEC.ADJ. (all.pos)
  • The frequencies of the letter string for each PoS are .4056.91.11.1. (all.pos.freq)
  • The lemma frequencies of the letter string for each PoS are .5235.477.11.1. (all.pos.lemma.freq)

This information allows you to limit your selection to, for instance, nouns and to nouns for which the most frequent PoS of the lemma is a noun as well. In this way, you can prune your stimulus list and avoid words that have unintended frequencies (e.g., are used often as a name or are low-frequency forms of verbs with a high lemma frequency).

Comments are closed.