The boilerplate removal enrichment can strip unnecessary content from item bodies. This is useful to extract and index the main content of a document.
|Enabled by default||No|
Table of Contents
boilerplate-removal step is used to detect and remove boilerplate content.
Documents are split into individual blocks. These blocks are then classified into two categories: good and bad. Bad blocks correspond to boilerplate content and are removed.
By default the processing step is disabled.
Sets the classifier used for extracting the relevant content. Defaults to the
This configuration value is a dictionary with two possible keys:
There are two classifiers available, geared towards different use cases.
This is the default classifier which tries to extract as much "good" content as possible.
This classifier is geared towards boilerplate detection on news sites. Additional news site specific heuristics are used to remove additional "bad" blocks.
NewsEndOfContentClassifier sub-classifier is used to detect where the news story ends. This class can be configured using additional rules to tailor the behavior to individual sites. An example processing config:
When end of content classifier finds a text block that starts with any of the given rules, it marks all later blocks as "bad" and does not include them in the final content. The rules are language-specific and the special code "all" can be used to define rules that always apply.
The following examples all use the Python SDK to show how the boilerplate removal enrichment step can be used.
The boilerplate removal step can be activated by passing in a processing config to the ItemUploader. The
DefaultClassifier is used in this example.
In the example above the processing pipeline is instructed to remove the boilerplate content (i.e. the first
<p>...</p> HTML element) from the item
New Data Source
The following example details how to enable 3rd party content augmentation and boilerplate removal for a new feed data source. For block classification the
NewsClassifier is used to detect boilerplate content. In addition, a custom rule to detect end-of-content blocks is added as well.
An example article from the configured source is depicted below. Article content is shown in green and boilerplate content in red. There is also a single end-of-content block shown in blue.
Existing Data Source
The following example details how to enable 3rd party content fetching and boilerplate removal for an existing source. Items which have already been processed are not updated.
In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the
link attribute) and use it as the item
body. After the content is fetched boilerplate is detected and removed.
- Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl, "Boilerplate Detection using Shallow Text Features", WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.
- Jan Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora, Brno, 2011