Executive Summary

The data processing pipeline can be instructed to fetch 3rd party content. Examples include external web sites accessible via the HTTP(S) protocol.

Architecture

Data Processing

The content-augmentation step is used to fetch 3rd party content.

3rd party content is fetched from the link attribute of an item sent to the data sink. The fetched content is used to set the body attribute.

By default the processing step is enabled, but configured to not fetch 3rd party content.

Examples

Python SDK

The following examples reference the Python SDK.

Item Uploader

The following example details how to enable 3rd party content fetching.

Code Block

language	py
linenumbers	true

from squirro_client import ItemUploader
# processing config to fetch 3rd party content
processing_config = {
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}
uploader = ItemUploader(..., processing_config=processing_config)
# item with a link attribute
items = [
    {
        'link': 'http://www.example.com',
        'title': 'Item 01',
    },
]
uploader.upload(items)

In the example above the processing pipeline is instructed to fetch the content from the site http://www.example.com and use it as the item body.

New Data Source

The following example details how to enable 3rd party content fetching for a new feed data source.

Code Block

language	py
linenumbers	true

from squirro_client import SquirroClient
client = SquirroClient(...)
# processing config to fetch 3rd party content
processing_config = {
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}
# source configuration
config = {
    'url': 'http://newsfeed.zeit.de/index',
    'processing': processing_config
}
# create new source subscription
client.new_subscription(
    project_id='...', object_id='default', provider='feed', config=config)

Existing Data Source

The following example details how to enable 3rd party content fetching for an existing source. Items which have already been processed are not updated.

Code Block

language	py
linenumbers	true

from squirro_client import SquirroClient
client = SquirroClient(...)
# get existing source configuration (including processing configuration)
source = client.get_project_source(project_id='...', source_id='...')
config = source.get('config', {})
processing_config = config.get('processing_config', {})
# modify processing configuration
processing_config['content-augmentation'] = {
    'enabled': True, 'fetch_link_content': True,
}
config['processing'] = processing_config
client.modify_project_source(project_id='...', source_id='...', config=config)

In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link attribute) and use it as the item bodyThis page can now be found at Content Augmentation on the Squirro Docs site.

Versions Compared

Old Version 2

New Version Current

Key

Executive Summary

Table of Contents

Architecture

Data Processing

Examples

Python SDK

Item Uploader

New Data Source

Existing Data Source

Page Comparison

Versions Compared

Old Version 2

New Version Current

Key

Executive Summary

Table of Contents

Architecture

Data Processing

Examples

Python SDK

Item Uploader

New Data Source

Existing Data Source