How to deal with zero word frequencies?

While making and comparing word frequency lists, we were often confronted with the question what to do with words that are not present in a corpus. Giving these words a frequency of 0 did not seem correct and also led to mathematical nuisances. Rather than selecting one option, we decided to do a bit of testing to see what worked well. As it happened, the easiest transformation, the Laplace transformation, turned out to be the best choice. You find our conclusions in Brysbaert & Diependaele (Behavior Research Methods, 2013).

As part of our efforts, Kevin Diependaele wrote a Python routine for the Good-Turing algorithm, which you can download in zip format or in tar.gz format. This text explains you how to run the programs.

In the coming months we will update our frequency lists and interactive websites with the corrected frequencies, so that the zero word frequencies should be a pain of the past.

Comments are closed.