Page Comparison

Table of Contents

Introduction

The Pipeline Workflows support processing data that have been already processed once.

Every workflow can be used by none, one, or more data sources. The rerun functionality of a workflow uses as input the data from these data sources that have been already retrieved and ingested once.

...

For example, in the above screenshot, there is a workflow for processing Binary Documents and is configured to be used by four data sources. Therefore, the rerun functionality will use as input the processed data of these four sources.

This functionality can be proven useful when you want to modify your workflow, for example, by adding a new step to it or modifying the configuration of an existing step, and would like to execute it against the same set of data and retrieve a potentially different set of Squirro Items which may be more relevant to your needs.

In order to invoke the rerun of a workflow, you may click on its three dots menu, and there you will find the Rerun option.

...

When you click on the Rerun option, a popup window is displayed where you can choose which rerun mode you would like to invoke.

...

Rerun Modes

There are two rerun modes available:

Rerun from Raw Data
Rerun from Index

Rerun from Raw Data

The rerun from raw data uses the data in the form as they were retrieved by a Squirro data loader plugin from the actual data source and added into ingester’s queue.

The ingester service processes the data of a source in one or more batches. When the ingester processes a batch, it moves it to the processed sub-directory of its filesystem queue (default location: /var/lib/squirro/inputstream/processed). When you invoke the Rerun from Raw data mode of a pipeline workflow, the batches of all the sources that this pipeline is configured to process, are looked up in the processed directory, and re-queued for processing.

There are 3 configuration options that control this rerun mode and they can be found in the /etc/squirro/common.ini file.

Code Block

language	bash

# whether to keep or not successfully processed data batches
keep_processed_data = true

# number of days and hours we keep around any processed data batches
# total time is days + hours
days_to_retain_processed_batches = 0
hours_to_retain_processed_batches = 1

If the keep_processed_data is false, then no batches are moved into the processed sub-directory.

The options days_to_retain_processed_batches and hours_to_retain_processed_batches control the time that the batches in the processed directory are kept. When a batch exceeds this threshold, it is removed from the processed directory.

Therefore, for the Rerun from Raw Data mode to be available, the batches of the data sources you want to rerun need to be found in the processed directory. When this mode is available, it is recommended to be used as it ensures a clean rerun.

For example, if you have a Pipeline Workflow with a sentence-level ML model, and you have ingested some data with it, you might have generated entities for your items. Now, if you perform a rerun without this ML model in your pipeline, then the entities will be deleted from your items. Similarly, if you perform a rerun with a different ML model in your pipeline, the old entities will be removed, and any entities from your new model will be there.

Rerun from Index

The rerun from index uses the data in the form of Squirro Items, as they are found in the Elasticsearch index of your Squirro project.

This rerun mode is always used when you rerun a single pipeline step, but it is also offered as an option when you want to rerun the whole pipeline workflow.

This mode is always available. However, it is not considered as stable as the other mode, and when the workflow includes a non-idempotent step the resulting Squirro Item will not be the same as the first time that it got ingested, even if the workflow has not changed. As an example, consider a Pipelet which modifies the title of an item by appending an exclamation mark. If we invoke the rerun functionality of a workflow on this item n times, the resulting Squirro Item will contain n exclamation marks in its titleThis page can now be found at Pipeline Reruns on the Squirro Docs site.

Versions Compared

Old Version 5

New Version Current

Key

Introduction

Rerun Modes

Rerun from Raw Data

Rerun from Index