Downloading

Different files are available for different purposes. See this post for more information on working with the PoS information in SUBTLEX-NL.

The most recent Excel files with Pos information and Zipf frequencies

  • Word frequency files on osf: one file with all words observed in the corpus (437K) and a file with all words observed in at least 2 films (150K). The latter is more interesting for most searches as it contains less noise.
  • The Zipf frequencies are based on the equation  Zipf=LOG10((frequency+1)/44.106)+3.
  • For words not present in the database (i.e., with zero frequency), the Zipf value is Zipf=LOG10(1/44.106)+3 = 1.3555.
  • Information on why you should use Zipf frequencies.

Letter strings with a lemma contextual diversity above 2 (134,723 entries).

All letter strings (437,503 entries).

Lemmas and wordforms with a lemma contextual diversity above 2, automatically POS-tagged (89,564 lemmas,182,099 wordforms)

All lemmas and wordforms, automatically POS-tagged (446,488 lemmas,554,339 wordforms)

Comments are closed.