Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link attribute) and use it as the item body. After the content is fetched boilerplate is detected and removed.

References

  1. Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl, "Boilerplate Detection using Shallow Text Features", WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.
  2. Jan Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora, Brno, 2011