Pipeline

Table of Contents

Architecture Overview

As outlined in the Architecture, the working with Squirro can be split into a number of steps:

The Load is handled with the data loading. The data loader will give Squirro a list of records to be indexed.

The pipeline’s task is to now convert those records into properly formatted Squirro items (see https://squirro.atlassian.net/wiki/spaces/DOC/pages/4161560/Item+Format) and store those items in the Squirro storage layer.

From that index they can then be retrieved by the Squirro dashboards for visualization and searching.

Pipeline Sections

The pipeline is split into the given sections mostly to aid understanding and configuration of the various steps.

 

Section

Description

Section

Description

Enrich

Extracting additional data from records or converting them into text is counted as enrichments. This includes language detection, deduplication, or converting binary documents to text.

Relate

Linking the ingested items within each other or with other data sources is part of this section. Most importantly this includes the https://squirro.atlassian.net/wiki/spaces/DOC/pages/26247170/Known+Entity+Extraction steps.

Discover

Discover includes steps around topic modelling and clustering, as well as analysis for the https://squirro.atlassian.net/wiki/spaces/DOC/pages/2396061784/Content-based+Typeahead.

Classify

Text classification, such as the models created with the https://squirro.atlassian.net/wiki/spaces/DOC/pages/2220458018/Squirro+AI+Studio, are part of this section.

Predict

Time series detection with the https://squirro.atlassian.net/wiki/spaces/DOC/pages/29261853/Trend+Detection module shows up here.

Recommend

This section includes the updating of recommendation models and insights generation. These are currently not yet exposed in the user interface.

Automate

Automated actions, such as sending of emails, is included as automations. Currently this section is empty in the user interface.

Index

This step is not included in the architecture charts, but can be seen and used in the pipeline editor. It includes the required steps to persist Squirro items on disk for searching.

Custom

Custom steps can be added to the pipeline in the form of https://squirro.atlassian.net/wiki/spaces/DOC/pages/7077924/Pipelets. Currently these pipelets always show up in a section called Custom, but this will be extended to allow each pipelet to be assigned to one of the above sections as well.

Processing

The pipeline steps are run sequentially. When a pipeline step fails for any reason, the item is re-queued and the full pipeline will be re-run on that item. If processing fails persistently (10 times by default) the item is dropped from the pipeline.

Some errors are handled by adding an error code to the item. The known error codes for this are documented in the Processing Error table.

Items are only displayed to the users once the full pipeline - with exception of Search Tagging - has run through. For details on the search tagging delay, see the https://squirro.atlassian.net/wiki/spaces/DOC/pages/2949475/Search+Tagging+and+Alerting documentation.

Configuration

A project can have one or more pipelines. Each data source is associated with one such pipeline. The pipelines are configured using the Pipeline Editor in the Setup space.