Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The content extraction enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.

This step is best combined with Content Augmentation and PDF Conversion to create the best searching experience for documents. See Indexing of Office documents for more.

...

Enrichment name

...

content-conversion

...

Stage

...

content

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Overview

The content-conversion step is used to convert incoming content to HTML. For supported document formats, incoming documents are split into individual pages, each represented as a separate HTML document.

The converted content is used to set the body attribute.

...

Supported Content MIME Types

The following content MIME types are supported for conversion.

Display Support refers to how the documents are displayed to the user. To display all office formats to the user with full display support, the PDF Conversion step can be inserted prior to this step. See Indexing of Office documents for a full guide.

...

File Extension

...

Mime Type

...

Pages Support

...

Display Support

...

.pdf

...

application/pdf

...

Yes

...

Full

...

.doc

...

application/msword

...

No

...

HTML only

...

.docx

...

application/vnd.openxmlformats-officedocument.wordprocessingml.document

...

No

...

HTML only

...

.xls

...

application/vnd.ms-excel

...

No

...

HTML only

...

.xlsx

...

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

...

No

...

HTML only

...

.ppt

...

application/vnd.ms-powerpoint

...

No

...

HTML only

...

.pptx

...

application/vnd.openxmlformats-officedocument.presentationml.presentation

...

No

...

HTML only

...

.rtf

...

text/rtf

...

No

...

HTML only

...

.odt

...

application/vnd.oasis.opendocument.text

...

No

...

HTML only

...

.ods

...

application/vnd.oasis.opendocument.spreadsheet

...

No

...

HTML only

...

.odp

...

application/vnd.oasis.opendocument.presentation

...

No

...

HTML only

...

.sxw

...

application/vnd.sun.xml.writer

...

No

...

HTML only

Configuration

There are no configuration options for this enrichmentThis page can now be found at Content Extraction on the Squirro Docs site.