Page Comparison

Squirro’s natively pipeline supports processing of complex document types, such as common Office formats or PDF. This guide explains how to set up the pipeline to achieve this.

Overview

Squirro can index and displaying a number common of Office document formats. This enables end users to interact with those documents directly and search within them.

When set up correctly, these documents become searchable and include thumbnails in the result list:

...

They can also be displayed directly in the user interface, without the user having to navigate away to an Office application:

...

Setup

The process to get to this is straightforward. Squirro provides all the required Pipeline Steps and default configuration out of the box.

Set up Pipeline Workflow

...

In the Setup space navigate to Pipeline.

...

At the top right press the pencil icon to enter edit mode.

...

At the bottom left choose + New Pipeline to create a new pipeline.

...

A list of pipeline presets is displayed. Select the Binary Documents workflow.

...

In the Pipeline Properties on the right side give the pipeline workflow a meaningful name, for example “Documents”.

...

This concludes the required setup. The steps that ensure documents are correctly indexed are:

PDF Conversion which converts Office formats to PDF.
Content Augmentation makes the contents of the document available to the rest of the pipeline.
Content Extraction extracts the text content from the document to make the document searchable.

Ingest Data

Once the pipeline workflow is set up, data ingestion can proceed like with any other data source. Make sure to select the newly created workflow to ensure the data is processed correspondingly.

...

This page can now be found at Indexing Common Documents on the Squirro Docs site.

Versions Compared

Old Version 2

New Version Current

Key

Table of Contents

Overview

Setup

Set up Pipeline Workflow

Ingest Data