The deduplication enrichment prevents duplication of items. Items that it has already seen are rejected and not indexed.
Enrichment name | deduplication |
---|
Stage | deduplication |
---|
Enabled by default | Yes |
---|
Table of Contents
Table of Contents |
---|
outline | true |
---|
exclude | Table of Contents |
---|
|
Overview
Items sent to Squirro come from different Sourcessources. Those sources sometimes send their data multiple times or in the case of Twitter mention the same Sitesite. To avoid having duplicates in the index, Squirro tries to detect identical items (duplicates) and nearly identical items (near-duplicates). This page describes the duplicate detection, near-duplicate detection is a separate enrichment.
There are two parts when it comes to the duplicate handling, the first one is the detection and the second one is the action /policy taken upon that.
Detection
The first thing Squirro tries to do is searching Squirro first searches for an item with the same external_id
(see Item Format). If no duplicate could be found the index is searched for a title
/link
combinationprevious item with the same external_id
exists, the deduplication enrichment then searches the index for a document with the same title
and link
.
Items from a bulk
source (e.g. created by the the ItemUploader or or DocumentUploader) are considered private
. To avoid data leakage, the deduplication will only look at items from this source for deduplication.
...
Once an item is considered to be a duplicate there are three policies that are being used:
...
it can be associated, replaced or updated, depending on the policy. For details on the policies see the configuration below.
Policy | Default | Description |
---|
associate | Default for user configurable sources (e.g. feed, Twitter) |
...
| The item that is already in the Squirro index gets the additional source_id in its list of sources and the incoming item is discarded. |
replace | ItemUploader sets this policy be default | The existing item in Squirro is deleted and the incoming item is processed as is. This is mostly used for sources which deliver growing documents like support cases which are amended by additional comments. |
update | - | The existing item in Squirro is updated with all the fields present in the incoming item. The values are replaced, so if the keywords property is set, all existing keywords are overwritten. |
Configuration
Field | Default | Description |
---|
deduplication_fields | Code Block |
---|
| [
["external_id"],
["title", "link"]
] |
| Which fields (see Item Format) to consider when searching for duplicates. Each sub-list is looked for individually and once a match has been found, processing aborts. If a sub-list contains more than one field, all of those fields need to match for an item to be considered duplicate. |
policy | associate | The policy decides which action is taken on duplicates. Policy | Description |
---|
associate | The item that is already in the Squirro index |
|
...
gets the additional source_id |
|
...
in its list of sources and the incoming item is discarded. |
|
...
replace | The existing item in Squirro |
|
...
is deleted and the incoming item |
|
...
is processed as is. This is mostly used for sources which deliver growing documents like support cases which are amended by additional comments. |
|
...
The ItemUploader sets this policy be default | update | The existing item in Squirro |
|
...
...
with all the fields present in the incoming item. |
|
...
...
...
These policies are not fixed and can be configured on source creation.
...
so if the keywords property is set, all existing keywords are overwritten. |
|
provider_support | twitter | Comma-separated list of providers that have support for comments. For providers that support comments all new comments are merged into the old item. |
Examples
To change the default policy, a processing_config
needs to be passed to the DocumentUploader or the or ItemUploader:
Code Block |
---|
language | py |
---|
linenumbers | true |
---|
|
processing_config = {
'deduplication': {
'enabled': True,
'policy': 'replace',
},
}
uploader = ItemUploader(processing_config=processing_config, token=..."…") |
If the API is used directly pass a processing_config
in the config
dictionary of the source.
...
- If a source sends in multiple duplicates within seconds, neither the order nor the full processing of all items is guaranteed
- Items Additional items yielded by a Pipelet pipelet will not be deduplicated as this because the deduplication step occurs early in the processing pipeline
...