The content cleanup enrichment cleans up incoming text and removes potentially malicious content from the HTML body.
Enrichment name | cleanup |
---|---|
Stage | indexing |
Table of Contents
Overview
The cleanup
step is used to clean content as it comes in. From text fields, such as title
or summary
, any HTML tags are removed. From the HTML field body
potentially harmful tags and attributes are removed, such as script tags.
Configuration
This enrichment does not take any configuration.
Examples
The following examples all use the Python SDK to show how the content augmentation enrichment step can be used.
Item Uploader
The following example details how to disable cleanup when uploading items using the ItemUploader.
from squirro_client import ItemUploader # processing config to fetch 3rd party content processing_config = { 'cleanup': { 'enabled': False, }, } uploader = ItemUploader(…, processing_config=config) # item with a link attribute items = [ { 'link': 'http://www.example.com', 'title': 'Item 01 <b>with HTML</b>', }, ] uploader.upload(items)
New Data Source
The following example details how to disable cleanup for a new feed data source.
from squirro_client import SquirroClient client = SquirroClient(None, None, cluster='https://next.squirro.net/') client.authenticate(refresh_token='293d…a13b') # processing config to fetch 3rd party content and detect boilerplate with the # news classifier processing_config = { 'cleanup': { 'enabled': False, }, } # source configuration config = { 'url': 'http://newsfeed.zeit.de/index', 'processing': processing_config } # create new source subscription client.new_subscription( project_id='…', object_id='default', provider='feed', config=config)
Existing Data Source
The following example details how to disable cleanup for an existing source. Items which have already been processed are not updated.
from squirro_client import SquirroClient client = SquirroClient(None, None, cluster='https://next.squirro.net/') client.authenticate(refresh_token='293d…a13b') # Get existing source configuration (including processing configuration) source = client.get_subscription(project_id='…', object_id='…', subscription_id='…') config = source.get('config', {}) processing_config = config.get('processing_config', {}) # Modify processing configuration processing_config['cleanup'] = { 'enabled': False, } config['processing'] = processing_config client.modify_subscription(project_id='…', object_id='…', subscription_id='…', config=config)