duometer

Duometer is a command-line tool that allows to efficiently identify near-duplicate pairs of documents in large collections of texts. It works on all platforms with Java runtime installed.

The program can be downloaded here.

You can read a tutorial illustrating how to use duometer.

More information and the source code are available on github.

Features

  • Efficiently finds pairs of documents that contain similar text.
  • Works well with very large collections of documents.
  • Can use multiple CPU cores.
  • Default settings should work well in most cases but can be customized for your particular purposes.

If you have any questions you can contact Paweł Mandera.

Comments are closed.