Processing Config

The configuration of built-in enrichments is done with the processing config. This can be used both for enabling and disabling of enrichments, as well as adding additional configuration for a step. In addition you can use processing config to mark a source or project to use the new Squirro Pipeline 2.0

There are two places this configuration can be specified:

Per source / subscription: when creating a new subscription, the processing instructions can be passed in to fine-tune the behavior for that one source.
Per project: a project also has a processing config, which applies to all items coming in for a project.
Within Data Loader using the argument --source-config-file.

Source processing config

To set up a processing configuration, specify the processing field in a source's config. The value of that field is again a dictionary, with the enrichment names as keys.

Enrichments which can be specified include:

Processing Step	Documentation Link
unshorten-link	Unshorten Link
deduplication	Duplicate Detection
content-augmentation	Content Augmentation
content-conversion	Content Extraction
language-detection	Language Detection
boilerplate-removal	Noise Removal
nearduplicate-detection	Near-Duplicate Detection
webshot	Thumbnail Extraction
filtering	Filtering

For example to set up a Twitter source with duplicate detection disabled, the following configuration would be used:

{
    "query": "Squirro",
    "processing": {
        "deduplication": {
            "enabled": false
        }
    }
}

Using the Python SDK a subscription for this could be created with the following code snippet:

client = SquirroClient(None, None, cluster='https://demo.squirro.net/')
client.authenticate(refresh_token='293d…a13b')
client.new_subscription(project_id, object_id='default', provider='twitter',
    config={
        'query': 'Squirro',
        'processing': {
            'deduplication': {
                'enabled': False
            }
        }
    })

The enabled property is available for every built-in enrichment and can be set to true or false. Some of the enrichments have additional configuration options, that are described on the corresponding page.

To mark a source for processing by the new Squirro Pipeline 2.0 (available starting with Squirro Version 2.5.1):

client = SquirroClient(None, None, cluster='https://demo-25.squirro.net/')
client.authenticate(refresh_token='293d…a13b')
client.new_subscription(project_id, object_id='default', provider='bulk',
    config={
        'pipeline': 'ingester',
        'name': 'large_subscription_1',
        'ext_id': 'large_subscription_1_id'
    })

Project processing config

Please contact the Squirro team if you want to use project processing configs.