Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The nearduplicate-detection step is used to detect near-duplicate items. Near-duplicates are shown to the user as "duplicates" in the interface. This compares to the deduplication enrichment, which instead discards any new duplicates completely and does not show them to users.

...

  1. Extract features from the title and body field. This is done by removing HTML tags, applying a lowercase filter, and splitting the corresponding text into tokens.
  2. The extracted features are used to calculate a 64 bit hash value according to Charikar's simhash technique [2].
  3. Several permutations of the calculated hash value are computed and stored.

Retrieval

To check whether an incoming item is a near-duplicate of an older item the following steps are necessary:

...