Page Comparison

The Noise Removal enrichment (internally used boilerplate-removal enrichment can strip ) strips unnecessary content from item bodies. This is useful to extract and index extracts and indexes only the main content of a document.

Enrichment name	boilerplate-removal
Stage	processing
Enabled by default	No

Overview

The boilerplate-removal step is used to detect and remove boilerplate content.

...

By default the processing step is disabled.

Configuration

Field Description

classifier

Sets the classifier used for extracting the relevant content. Defaults to the DefaultClassifier.

This configuration value is a dictionary with two possible keys:

name: The name of the classifier to instantiate. See the table below for valid classifiers.
args: Arguments to pass to the classifier on instantiation.

There are two classifiers available, geared towards different use cases.

DefaultClassifier

This is the default classifier which tries to extract as much "good" content as possible.

NewsClassifier

This classifier is geared towards boilerplate detection on news sites. Additional news site specific heuristics are used to remove additional "bad" blocks.

...

When end of content classifier finds a text block that starts with any of the given rules, it marks all later blocks as "bad" and does not include them in the final content. The rules are language-specific and the special code "all" can be used to define rules that always apply.

Examples

The following examples all use the Python SDK to show how the boilerplate removal enrichment step can be used.

Item Uploader

The boilerplate removal step can be activated by passing in a processing config to the ItemUploader. The DefaultClassifier is used in this example.

...

In the example above the processing pipeline is instructed to remove the boilerplate content (i.e. the first <p>...</p> HTML element) from the item body.

New Data Source

The following example details how to enable 3rd party content augmentation and boilerplate removal for a new feed data source. For block classification the NewsClassifier is used to detect boilerplate content. In addition, a custom rule to detect end-of-content blocks is added as well.

...

An example article from the configured source is depicted below. Article content is shown in green and boilerplate content in red. There is also a single end-of-content block shown in blue.

Image Modified

Existing Data Source

The following example details how to enable 3rd party content fetching and boilerplate removal for an existing source. Items which have already been processed are not updated.

...

In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link attribute) and use it as the item body. After the content is fetched boilerplate is detected and removed.

References

Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl, "Boilerplate Detection using Shallow Text Features", WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.
Jan Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora, Brno, 2011

Versions Compared

Old Version 7

New Version 8

Key

Table of Contents