The Squirro Pipeline 2.0 allows you to import large amounts of data in a resource-friendly way so that Squirro remains responsive as data flows into Squirro. The Squirro Pipeline 2.0 is made available starting with Squirro version 2.5.1, and will become the default way to import data into Squirro with release 2.5.3 later in 2017. This document describes steps to follow to use Pipeline 2.0 to ingest large amount of data in Squirro versions 2.5.1 and 2.5.2 . |
Whenever you have hundreds of millions of documents to import or if you don't need all the built-in enrichments that come at a processing cost and slow down data ingestion, Pipeline 2.0 is for you. A key feature of Pipeline 2.0 is that you would "opt in" to enrichments rather than paying the price of running all enrichments.
As the Pipeline 2.0 can run alongside Pipeline 1.0, you have a couple of options:
Because Pipeline 2.0 does not run any of the built-in enrichments, switching existing subscriptions may have unintended side-effects as your application likely relies on the behavior. In such cases we recommend contacting Squirro support if you are interested in speeding up the import of existing data subscriptions.
Pipeline 2.0 relies on a file system to queue data in batches before Squirro inserts the data into ElasticSearch in bulk. Options are:
Internally Pipeline 2.0 is known as "Ingester", or specifically the sqingesterd service. To have all new subscriptions import data via the Pipeline 2.0, configure the Squirro API Provider service (the service that all Squirro data imports interact with), like so:
[provider] # new pipeline configs: # processing_mode controls the transition to the ingester ("new pipeline") # modes: legacy (all sources go through only old pipeline and bulk pipeline), # tee (all sources go through legacy & new pipelines for "shadow-testing"), # default_legacy (by default sources go through legacy # pipelines, but can be overridden in source config to new pipeline), # default_ingester (by default sources go through new pipeline, but can be # overridden in source config to legacy pipelines), # ingester (all sources exclusively go through new pipeline) # processing_mode = default_ingester |
Ensure that /var/lib/squirro/inputstream has enough space. At the minimum we require 10 GB is space. The following setting in /etc/squirro/common.ini ensures that the Pipeline 2.0 does not fill up the disk space and leaves at least 10 GB of space for other users of the same disk volume.
[content_filesystem_stream] # directories for pipeline 2.0 queued data source_metadata_directory = /var/lib/squirro/inputstream data_directories = %(source_metadata_directory)s # number of gigabytes to require at least to continue writing to the file system back_off_when_data_disk_space_falls_below_in_gigabytes = 10 back_off_when_metadata_disk_space_falls_below_in_gigabytes = %(back_off_when_data_disk_space_falls_below_in_gigabytes)s |
If you elect not to use the local file system, change source_metadata_directory and data_directories accordingly.
As in the previous section, ensure that there is enough space for temporarily queued data files.
Then follow the step described under Processing Config to choose the Pipeline 2.0 when creating a subscription.
We are in the process of enabling our Data Import Tools to allow you to more conveniently choose the Pipeline 2.0 starting shortly with the Squirro Data Loader.