Executive Summary
The data processing pipeline can be instructed to fetch 3rd party content. Examples include external web sites accessible via the HTTP(S) protocol.
Table of Contents
Architecture
Data Processing
The content-augmentation
step is used to fetch 3rd party content.
3rd party content is fetched from the link
attribute of an item send to the data sink. The fetched content is used to set the body
attribute.
By default the processing step is enabled, but configured to not fetch 3rd party content.
Examples
Python SDK
The following examples reference the Python SDK.
Item Uploader
The following example details how to enable 3rd party content fetching.
from squirro_client import ItemUploader # processing config to fetch 3rd party content processing_config = { 'content-augmentation': { 'enabled': True, 'fetch_link_content': True, }, } uploader = ItemUploader(..., processing_config=processing_config) # item with a link attribute items = [ { 'link': 'http://www.example.com', 'title': 'Item 01', }, ] uploader.upload(items)
In the example above the processing pipeline is instructed to fetch the content from the site http://www.example.com and use it as the item body
.
New Data Source
The following example details how to enable 3rd party content fetching for a new feed data source.
from squirro_client import SquirroClient client = SquirroClient(...) # processing config to fetch 3rd party content processing_config = { 'content-augmentation': { 'enabled': True, 'fetch_link_content': True, }, } # source configuration config = { 'url': 'http://newsfeed.zeit.de/index', 'processing': processing_config } # create new source subscription client.new_subscription( project_id='...', object_id='default', provider='feed', config=config)
Existing Data Source
The following example details how to enable 3rd party content fetching for an existing source. Items which have already been processed are not updated.
from squirro_client import SquirroClient client = SquirroClient(...) # get existing source configuration (including processing configuration) source = client.get_project_source(project_id='...', source_id='...') config = source.get('config', {}) processing_config = config.get('processing_config', {}) # modify processing configuration processing_config['content-augmentation'] = { 'enabled': True, 'fetch_link_content': True, } config['processing'] = processing_config client.modify_project_source(project_id='...', source_id='...', config=config)
In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link
attribute) and use it as the item body
.