Processing Config
The configuration of built-in enrichments is done with the processing config. This can be used both for enabling and disabling of enrichments, as well as adding additional configuration for a step. In addition you can use processing config to mark a source or project to use the new Squirro Pipeline 2.0
There are two places this configuration can be specified:
- Per source / subscription: when creating a new subscription, the processing instructions can be passed in to fine-tune the behavior for that one source.
- Per project: a project also has a processing config, which applies to all items coming in for a project.
- Within Data Loader using the argument --source-config-file.
Source processing config
To set up a processing configuration, specify the processing
field in a source's config. The value of that field is again a dictionary, with the enrichment names as keys.
Enrichments which can be specified include:
Processing Step | Documentation Link |
---|---|
unshorten-link | |
deduplication | Duplicate Detection |
content-augmentation | |
content-conversion | Content Extraction |
language-detection | Language Detection |
boilerplate-removal | Noise Removal |
nearduplicate-detection | Near-Duplicate Detection |
webshot | Thumbnail Extraction |
filtering | Filtering |
For example to set up a Twitter source with duplicate detection disabled, the following configuration would be used:
{ "query": "Squirro", "processing": { "deduplication": { "enabled": false } } }
Using the Python SDK a subscription for this could be created with the following code snippet:
client = SquirroClient(None, None, cluster='https://demo.squirro.net/') client.authenticate(refresh_token='293d…a13b') client.new_subscription(project_id, object_id='default', provider='twitter', config={ 'query': 'Squirro', 'processing': { 'deduplication': { 'enabled': False } } })
The enabled
property is available for every built-in enrichment and can be set to true
or false
. Some of the enrichments have additional configuration options, that are described on the corresponding page.
To mark a source for processing by the new Squirro Pipeline 2.0 (available starting with Squirro Version 2.5.1):
client = SquirroClient(None, None, cluster='https://demo-25.squirro.net/') client.authenticate(refresh_token='293d…a13b') client.new_subscription(project_id, object_id='default', provider='bulk', config={ 'pipeline': 'ingester', 'name': 'large_subscription_1', 'ext_id': 'large_subscription_1_id' })
Project processing config
Please contact the Squirro team if you want to use project processing configs.