Fast computation of average Levenshtein distances in python
averageLD.py is a command-line python program for the fast computation of average Levenshtein distances and is especially useful to compute the OLD20 measure (the average levensthein distance of the 20 nearest orthographic neighbors; see also Yarkoni, Balota & Yap, 2008). It is orders of magnitude faster than Tal Yarkoni’s LDCalc program.
As a reference case, the OLD20 measures for every string in the Lexique database were computed using a dual core 2GHZ MacMini. The entire database (142,693 items) was processed using itself as a reference lexicon in just under 3 hours. On average, over 13 OLD20 values were computed per second.
The program depends on the python levenshtein library, which should downloaded and installed before usage.
This program is provided to the community as is. It’s not hard to use, but some knowledge of command line programs is needed. No warranty is given. Comments are welcome and should be addressed to emmanuel.keuleers@ugent.be.
Downloading
The program can be downloaded here.
Usage
- Make sure python≥2.6 and python-levenshtein are installed!
- Make an inputfile containing the strings for which you want to compute OLD20. Each string should be on a separate line.
- Make a file containing your lexicon (all possible neighbors). Each string should be on a separate line.
- Put both files in the averageLD folder.
- CD to the folder, and type the command line instruction. For the average Levenshtein distance of the 20 nearest neighbors, type: python averageLD.py -f INPUTFILE -l LEXICONFILE -k 20 -o OUTPUTFILE
- If your files are not in utf-8 encoding you can specify another encoding using the —encoding option on the command line