Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

The Squirro Pipeline 2.0 allows you to import large amounts of data in such a resource-friendly way so that Squirro remains responsive as data flows into Squirro. The even during high volume data inflow. This document describes the steps you can follow to use Pipeline 2.0 to import large amounts of data into Squirro installations versioned 2.5.1 or 2.5.2. The Squirro Pipeline 2.0 is made available starting with Squirro version 2.5.1, and will become the default way to import data into Squirro with release 2.5.3 later in 2017. This document describes steps to follow to use Pipeline 2.0 to ingest large amount of data in Squirro versions 2.5.1 and 2.5.2 .

When to use Pipeline 2.0

Whenever you have hundreds of millions of documents to import or if you don't need all the built-in enrichments that come at a processing cost and slow down data ingestion, Pipeline 2.0 is for you. A key feature of Pipeline 2.0 is that you would "opt in" to enrichments rather than paying the price of running all enrichments.

As Because the Pipeline 2.0 can run alongside Pipeline 1.0, you have a couple of options:

  1. Enable Pipeline 2.0 Squirro Server-wide: This is the best option if you are starting brand-new Squirro projects on a dedicated Squirro installation and need to ingest large data volumes and plan to use the Squirro Data Loader to import data.
  2. Enable Pipeline 2.0 for individual projects or subscriptions: This may be an option for existing projects to which you may be planning to add more data subscriptions. In this case, you would specify this in the Processing Config of newly created augment with additional, large data subscriptions.

Because Pipeline 2.0 does not run any of the built-in enrichments by default (until you opt in each enrichment individually), switching existing subscriptions may have unintended side-effects as . This is because your application likely relies and users have likely come to rely on the behavior introduced by the enrichments. In such cases we recommend contacting Squirro support if you are interested in speeding up the import of existing data subscriptions.

...

Pipeline 2.0 relies on a file system to queue data in batches before Squirro inserts the data into ElasticSearch in bulk. Options for storing this temporary data files are:

  1. Local file system: This is the default out-of-the-box configuration with queued data placed under /var/lib/squirro/inputstream/ . We recommend using Raid-1 or some other form of redundancy to guard against loss of this temporary data. The IO access will be generated by the Pipeline 2.0 consists of sequential reads and writes consistent with log structured storage in which data is added sequentially "to the end" of the queue and read and deleted "from the beginning" of the queue as data is inserted into ElasticSearch.

    Note that in the case of multiple Squirro Cluster Node installations that the queue and file system are implicitly "sharded" with a different subset of the data going to different servers and disks. This helps scale the data capacity of the Pipeline 2.0 much like you can scale ElasticSearch by adding more Indexing Servers.

  2. Amazon Elastic Block Storage: For cloud installations hosted in AWS, Amazon EBS is a suitable choice.
  3. Network attached storage: NAS is expensive, so although network attached storage will work, it is likely not the first choice.

...

Enabling Pipeline 2.0 for Individual Subscriptions

As The same as in the previous section, ensure that there is enough space and sequential IO capacity for temporarily queued data files.

...