Table of Contents |
---|
Introduction
The Pipeline Workflows support processing data that have been already processed once.
Every workflow can be used by none, one, or more data sources. The rerun functionality of a workflow uses as input the data from these data sources that have been already retrieved and ingested once.
...
For example, in the above screenshot, there is a workflow for processing Binary Documents and is configured to be used by four data sources. Therefore, the rerun functionality will use as input the processed data of these four sources.
This functionality can be proven useful when you want to modify your workflow, for example, by adding a new step to it or modifying the configuration of an existing step, and would like to execute it against the same set of data and retrieve a potentially different set of Squirro Items which may be more relevant to your needs.
In order to invoke the rerun of a workflow, you may click on its three dots menu, and there you will find the Rerun
option.
...
When you click on the Rerun
option, a popup window is displayed where you can choose which rerun mode you would like to invoke.
...
Rerun Modes
There are two rerun modes available:
Rerun from Raw Data
Rerun from Index
Rerun from Raw Data
The rerun from raw data uses the data in the form as they were retrieved by a Squirro data loader plugin from the actual data source and added into ingester’s queue.
The ingester service processes the data of a source in one or more batches. When the ingester processes a batch, it moves it to the processed
sub-directory of its filesystem queue (default location: /var/lib/squirro/inputstream/processed
). When you invoke the Rerun from Raw data
mode of a pipeline workflow, the batches of all the sources that this pipeline is configured to process, are looked up in the processed
directory, and re-queued for processing.
There are 3 configuration options that control this rerun mode and they can be found in the /etc/squirro/common.ini
file.
Code Block | ||
---|---|---|
| ||
# whether to keep or not successfully processed data batches
keep_processed_data = true
# number of days and hours we keep around any processed data batches
# total time is days + hours
days_to_retain_processed_batches = 0
hours_to_retain_processed_batches = 1 |
If the keep_processed_data
is false
, then no batches are moved into the processed
sub-directory.
The options days_to_retain_processed_batches
and hours_to_retain_processed_batches
control the time that the batches in the processed
directory are kept. When a batch exceeds this threshold, it is removed from the processed
directory.
Therefore, for the Rerun from Raw Data
mode to be available, the batches of the data sources you want to rerun need to be found in the processed
directory. When this mode is available, it is recommended to be used as it ensures a clean rerun.
For example, if you have a Pipeline Workflow with a sentence-level ML model, and you have ingested some data with it, you might have generated entities for your items. Now, if you perform a rerun without this ML model in your pipeline, then the entities will be deleted from your items. Similarly, if you perform a rerun with a different ML model in your pipeline, the old entities will be removed, and any entities from your new model will be there.
Rerun from Index
The rerun from index uses the data in the form of Squirro Items, as they are found in the Elasticsearch index of your Squirro project.
This rerun mode is always used when you rerun a single pipeline step, but it is also offered as an option when you want to rerun the whole pipeline workflow.
This mode is always available. However, it is not considered as stable as the other mode, and when the workflow includes a non-idempotent step the resulting Squirro Item will not be the same as the first time that it got ingested, even if the workflow has not changed. As an example, consider a Pipelet which modifies the title of an item by appending an exclamation mark. If we invoke the rerun functionality of a workflow on this item n times, the resulting Squirro Item will contain n exclamation marks in its titleThis page can now be found at Pipeline Reruns on the Squirro Docs site.