...
In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link
attribute) and use it as the item body
. After the content is fetched boilerplate is detected and removed.
References
- Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl, "Boilerplate Detection using Shallow Text Features", WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.
- Jan Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora, Brno, 2011