Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The content conversion enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.

Enrichment namecontent-conversion
Stagecontent
Enabled by defaultYes

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

...

The following content MIME types are supported for conversion.

File ExtensionMime TypePages SupportDisplay Support
.pdfapplication/pdfYesFull
.docapplication/mswordNoHTML only
.docxapplication/vnd.openxmlformats-officedocument.wordprocessingml.documentNoHTML only
.xlsapplication/vnd.ms-excelNoHTML only
.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheetNoHTML only
.pptapplication/vnd.ms-powerpointNoHTML only
.pptxapplication/vnd.openxmlformats-officedocument.presentationml.presentationNoHTML only
.rtftext/rtfNoHTML only
.odtapplication/vnd.oasis.opendocument.textNoHTML only
.odsapplication/vnd.oasis.opendocument.spreadsheetNoHTML only
.odpapplication/vnd.oasis.opendocument.presentationNoHTML only
.sxwapplication/vnd.sun.xml.writerNoHTML only

Configuration

There are no configuration options for this enrichment, with the exception of the enabled property to enable and disable it.

...

In the example above the processing pipeline is instructed to convert the binary content to HTML and use it as the item body. The original document display and Squirro document display are depicted below.

File ExtensionOriginal Document DisplaySquirro Document DisplayDisplay Support
.pdf
Image ModifiedImage Modified
Image ModifiedImage Modified
Full
.docx
Image ModifiedImage Modified
Image ModifiedImage Modified
HTML only

New Data Source

The following example details how to disable content conversion for feed data source.

...