Executive Summary

The data processing pipeline can be instructed to detect and remove boilerplate content. This is useful to extract and index the main content of a document.

Architecture

Data Processing

The boilerplate-removal step is used to detect and remove boilerplate content.

Documents are split into individual blocks. Blocks are classified into two categories: good and bad. Bad blocks correspond to boilerplate content are are removed.

By default the processing step is disabled.

Classifiers

The following classifiers are available.

Name	Description
`DefaultClassifier`	Default classifier which tries to extract as much "good" content as possible.
`NewsClassifier`	Classifier geared towards boilerplate detection on news sites. Additional news site specific heuristics are used to remove additional "bad" blocks.

Examples

Python SDK

Item Uploader

The following example details how to enable boilerplate removal. Unless specified otherwise the DefaultClassifier is used to detect boilerplate content.

from squirro_client import ItemUploader
# processing config to detect boilerplate with the default classifier
processing_config = {
    'boilerplate-removal': {
        'enabled': True,
    },
}
uploader = ItemUploader(..., processing_config=config)
html_body = """
<html><body>
<p><a src="http://www.example.com">Boilerplate</a></p>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras aliquet
venenatis blandit. Phasellus dapibus mi eu metus maximus, nec malesuada urna
congue. Vivamus in cursus risus. Sed neque ligula, lobortis in sollicitudin
quis, efficitur eu metus. Pellentesque eu nunc sit amet turpis bibendum
volutpat eu ac ante. Nam posuere eleifend rhoncus. Vivamus purus tellus,
interdum ac semper euismod, scelerisque ut ipsum. Phasellus ut convallis nunc,
quis finibus velit. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. Maecenas euismod placerat diam, at
pellentesque quam eleifend ac. Nunc quis est laoreet, hendrerit dui vel,
ornare sem. Integer volutpat ullamcorper orci quis accumsan. Proin
pellentesque vulputate pellentesque. Sed sapien ante, elementum sed lorem vel,
bibendum tristique arcu.</p>
</body></html>
"""
items = [
    {
        'body': html_body,
        'title': 'Item 01',
    },
]
uploader.upload(items)

In the example above the processing pipeline is instructed to remove the boilerplate content (i.e. the first <p>...</p> HTML element) from the item body.

New Data Source

The following example details how to enable 3rd party content fetching and boilerplate removal for a new feed data source. For block classification the NewsClassifier is used to detect boilerplate content. In addition, a custom rule to detect end-of-content blocks is added as well. All blocks which are present after blocks matching the regular expressions are classified as being boilerplate.

from squirro_client import SquirroClient
client = SquirroClient(...)
# processing config to fetch 3rd party content and detect boilerplate with the
# news classifier
processing_config = {
    'boilerplate-removal': {
        'classifier': {
            'args': {
                'NewsEndOfContentClassifier': {
                    'rules': {
                        'de': [
                            '^Mehr zum Thema$',
                        ],
                    },
                },
            },
            'name': 'NewsClassifier',
        },
        'enabled': True,
    },
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}
# source configuration
config = {
    'url': 'http://newsfeed.zeit.de/index',
    'processing': processing_config
}
# create new source subscription
client.new_subscription(
    project_id='...', object_id='default', provider='feed', config=config)

An example article from the configured source is depicted below. Article content is shown in green and boilerplate content in red. There is also a single end-of-content block shown in blue.

Existing Data Source

The following example details how to enable 3rd party content fetching and boilerplate removal for an existing source. Items which have already been processed are not updated.

from squirro_client import SquirroClient
client = SquirroClient(...)
# get existing source configuration (including processing configuration)
source = client.get_project_source(project_id='...', source_id='...')
config = source.get('config', {})
processing_config = config.get('processing_config', {})
# modify processing configuration
processing_config['content-augmentation'] = {
    'enabled': True, 'fetch_link_content': True,
}
processing_config['boilerplate-removal'] = {'enabled': True}
config['processing'] = processing_config
client.modify_project_source(project_id='...', source_id='...', config=config)

In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link attribute) and use it as the item body. After the content is fetched boilerplate is detected and removed.

References

Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl, "Boilerplate Detection using Shallow Text Features", WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.
Jan Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora, Brno, 2011

Boilerplate Removal