The near duplicate detection enrichment finds previous items that are very similar to the new item and links the items together. This allows to reduce the clutter in search results.
Enrichment name | nearduplicate-detection |
---|---|
Stage | processing |
Enabled by default | Yes, except for the bulk provider (affects items that are uploaded through the ItemUploader, DocumentUploader, File Importer, etc.) |
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
...
As a consequence the near-duplicate detection enrichment can be more aggressive with its algorithm than the deduplication enrichment. Near-duplicate items are identified based on the the the title and body field. A locality-sensitive hashing method is used to map similar items to the same hash values.
...
- Extract features, calculated a 64 bit hash value, and the corresponding permutations as described above.
- For each permutation, retrieve all stored permutations whose top n bit-positions match the top n bit-positions of the permutation.
- For each permutation identified above, calculate the hamming distance and check whether the distance is below a configured threshold k. If that is the case the current item is a near-duplicate candidate and could be folded away.
- For all items found above, check whether they have been created 72 hours before or after the current item. The item created_at field is used for this process. If this requirement is met the current item is folded away under the first matching item.
Configuration
Field | Default | Description |
---|---|---|
cut_off_hours | 72 | Number of hours within which two stories must have been created to be considered near duplicates. |
Examples
These examples use the Python SDK.
...