The content augmentation enrichment fetches additional content from 3rd party systems.

The data processing pipeline can be instructed to fetch 3rd party content. Examples include external web sites accessible via the HTTP(S) protocol.

Enrichment name	content-augmentation
Stage	content
Enabled by default	Yes

Architecture

...

Overview

The content-augmentation step is used to fetch 3rd party content.

3rd party content is fetched from the link attribute of an item sent to the data sink. The It has two steps during this process: first the content of all uploaded files is fetched. That content is also used to guess the MIME type for these files. After that, and only if configured to do so, the content augmentation enrichment downloads the content from the link attribute. That fetched content is used to set the body attribute.

By default the processing step is enabled, but configured to not fetch 3rd party content.

Image Removed

Examples

Python SDK

The following examples reference the Python SDKWhen link fetching is enabled, this step will often be combined with the Boilerplate Removal enrichment.

Image Added

Configuration

Field	Description
fetch_link_content	Boolean value indicating whether to fetch the content from the web site referenced with the `link` attribute. Default: `false`.

Examples

The following examples all use the Python SDK to show how the content augmentation enrichment step can be used.

Item Uploader

The following example details how to enable 3rd party content fetching.

Code Block

language	py
linenumbers	true

from squirro_client import ItemUploader

# processing config to fetch 3rd party content
processing_config = {
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}

uploader = ItemUploader(...…, processing_config=processing_config)
# item with a link attribute
items = [
    {
        'link': 'http://www.example.com',
        'title': 'Item 01',
    },
]
uploader.upload(items)

...

Code Block

language	py
linenumbers	true

from squirro_client import SquirroClient
 
client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')
 
# processing config to fetch 3rd party content and detect boilerplate with the
# news classifier
processing_config = {
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}
 
# source configuration
config = {
    'url': 'http://newsfeed.zeit.de/index',
    'processing': processing_config
}
 
# create new source subscription
client.new_subscription(
    project_id='...…', object_id='default', provider='feed', config=config)

...

Code Block

language	py
linenumbers	true

from squirro_client import SquirroClient
 
client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')
 
# getGet existing source configuration (including processing configuration)
source = client.get_project_source(project_id='...…', source_id='...…')
config = source.get('config', {})
processing_config = config.get('processing_config', {})
 
# modifyModify processing configuration
processing_config['content-augmentation'] = {
    'enabled': True,
    'fetch_link_content': True,
}
config['processing'] = processing_config
client.modify_project_source(project_id='...…', source_id='...…', config=config)

In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link attribute) and use it as the item body.

Versions Compared

Old Version 3

New Version 4

Key

Table of Contents

Architecture

Overview

Examples

Python SDK

Configuration

Examples

Item Uploader

Page Comparison

Versions Compared

Old Version 3

New Version 4

Key

Table of Contents

Architecture

Overview

Examples

Python SDK

Configuration

Examples

Item Uploader