Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The content cleanup enrichment cleans up incoming text and removes potentially malicious content from the HTML body.

...

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Overview

The cleanup step is used to clean content as it comes in. From text fields, such as title or summary, any HTML tags are removed. From the HTML field body potentially harmful tags and attributes are removed, such as script tags.

Image Removed

Configuration

This enrichment does not take any configuration.

Examples

The following examples all use the Python SDK to show how the content augmentation enrichment step can be used.

Item Uploader

The following example details how to disable cleanup when uploading items using the ItemUploader.

Code Block
languagepy
linenumberstrue
from squirro_client import ItemUploader

# processing config to fetch 3rd party content
processing_config = {
    'cleanup': {
        'enabled': False,
    },
}

uploader = ItemUploader(…, processing_config=config)
# item with a link attribute
items = [
    {
        'link': 'http://www.example.com',
        'title': 'Item 01 <b>with HTML</b>',
    },
]
uploader.upload(items)

New Data Source

The following example details how to disable cleanup for a new feed data source.

Code Block
languagepy
linenumberstrue
from squirro_client import SquirroClient
 
client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')
 
# processing config to fetch 3rd party content and detect boilerplate with the
# news classifier
processing_config = {
    'cleanup': {
        'enabled': False,
    },
}
 
# source configuration
config = {
    'url': 'http://newsfeed.zeit.de/index',
    'processing': processing_config
}
 
# create new source subscription
client.new_subscription(
    project_id='…', object_id='default', provider='feed', config=config)

Existing Data Source

The following example details how to disable cleanup for an existing source. Items which have already been processed are not updated.

...

languagepy
linenumberstrue

...

This page can now be found at Content Standardization on the Squirro Docs site.