duometer
Duometer is a command-line tool that allows to efficiently identify near-duplicate pairs of documents in large collections of texts. It works on all platforms with Java runtime installed.
The program can be downloaded here.
You can read a tutorial illustrating how to use duometer.
More information and the source code are available on github.
Features
- Efficiently finds pairs of documents that contain similar text.
- Works well with very large collections of documents.
- Can use multiple CPU cores.
- Default settings should work well in most cases but can be customized for your particular purposes.
If you have any questions you can contact Paweł Mandera.