The content conversion enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.
Enrichment name | content-conversion |
---|---|
Stage | content |
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
...
The following content MIME types are supported for conversion.
File Extension | Mime Type | Pages Support | Display Support |
---|---|---|---|
.pdf | application/pdf | Yes | Full |
.doc | application/msword | No | HTML only |
.docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | No | HTML only |
.xls | application/vnd.ms-excel | No | HTML only |
.xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | No | HTML only |
.ppt | application/vnd.ms-powerpoint | No | HTML only |
.pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation | No | HTML only |
.rtf | text/rtf | No | HTML only |
.odt | application/vnd.oasis.opendocument.text | No | HTML only |
.ods | application/vnd.oasis.opendocument.spreadsheet | No | HTML only |
.odp | application/vnd.oasis.opendocument.presentation | No | HTML only |
.sxw | application/vnd.sun.xml.writer | No | HTML only |
Configuration
There are no configuration options for this enrichment, with the exception of the enabled
property to enable and disable it.
...
In the example above the processing pipeline is instructed to convert the binary content to HTML and use it as the item body
. The original document display and Squirro document display are depicted below.
File Extension | Original Document Display | Squirro Document Display | Display Support |
---|---|---|---|
.pdf | Full | ||
.docx | HTML only |
New Data Source
The following example details how to disable content conversion for feed data source.
...