The content extraction enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.
This step is best combined with Content Augmentation and PDF Conversion to create the best searching experience for documents. See Indexing of Office documents for more.
...
Enrichment name
...
content-conversion
...
Stage
...
content
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
Overview
The content-conversion
step is used to convert incoming content to HTML. For supported document formats, incoming documents are split into individual pages, each represented as a separate HTML document.
The converted content is used to set the body
attribute.
...
Supported Content MIME Types
The following content MIME types are supported for conversion.
Display Support refers to how the documents are displayed to the user. To display all office formats to the user with full display support, the PDF Conversion step can be inserted prior to this step. See Indexing of Office documents for a full guide.
...
File Extension
...
Mime Type
...
Pages Support
...
Display Support
...
.pdf
...
application/pdf
...
Yes
...
Full
...
.doc
...
application/msword
...
No
...
HTML only
...
.docx
...
application/vnd.openxmlformats-officedocument.wordprocessingml.document
...
No
...
HTML only
...
.xls
...
application/
vnd.ms-excel
...
No
...
HTML only
...
.xlsx
...
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
...
No
...
HTML only
...
.ppt
...
application/
vnd.ms-powerpoint
...
No
...
HTML only
...
.pptx
...
application/vnd.openxmlformats-officedocument.presentationml.presentation
...
No
...
HTML only
...
.rtf
...
text/rtf
...
No
...
HTML only
...
.odt
...
application/vnd.oasis.opendocument.text
...
No
...
HTML only
...
.ods
...
application/vnd.oasis.opendocument.spreadsheet
...
No
...
HTML only
...
.odp
...
application/vnd.oasis.opendocument.presentation
...
No
...
HTML only
...
.sxw
...
application/vnd.sun.xml.writer
...
No
...
HTML only
Configuration
There are no configuration options for this enrichmentThis page can now be found at Content Extraction on the Squirro Docs site.