Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

By default the processing step is disabled.

Configuration

...

Sets the classifier used for extracting the relevant content. Defaults to the DefaultClassifier.

This configuration value is a dictionary with two possible keys:

...

Configuration is done using a JSON dictionary, which takes two key/values at the top level.

:
FieldDescription
name

The name of the classifier to instantiate. See the table below for valid classifiers. Sets the classifier used for extracting the relevant content. Defaults to the DefaultClassifier.

argsArguments to pass to the classifier on instantiation.

There are two classifiers available, geared towards different use cases.

...

The NewsEndOfContentClassifier sub-classifier is used to detect where the news story ends. This class can be configured using additional rules to tailor the behavior to individual sites. An example processing configconfiguration:

Code Block
languagejs
{
    "boilerplate-removalname": {
        "enabled": true,
        "classifier": {
     "NewsClassifier",
      "args": {
    
           "NewsEndOfContentClassifier": {
                    "rules": {
       
                "en": [
       
                    "Share this story"
                        ]
                    }
                }
  
         },

           "name": "NewsClassifier"
        }
    }
}

When end of content classifier finds a text block that starts with any of the given rules, it marks all later blocks as "bad" and does not include them in the final content. The rules are language-specific and the special code "all" can be used to define rules that always apply.

Examples

The following examples all use the Python SDK to show how the boilerplate removal enrichment step can be used.

...