Table of Contents |
---|
Introduction
The Pipeline Workflows support processing data that have been already processed once.
...
Rerun from Raw Data
The rerun from raw data utilizes uses the data in the form as they were retrieved by a Squirro data loader plugin from the actual data source .
However, please note that this mode is not always available as it requires the storage of the raw data. Also, it is not guaranteed that it will be possible to use all the raw data from the data sources of the pipeline workflow because there is a configurable data retention policy in place which periodically removes these data from the server.
When it and added into ingester’s queue.
The ingester service processes the data of a source in one or more batches. When the ingester processes a batch, it moves it to the processed
sub-directory of its filesystem queue (default location: /var/lib/squirro/inputstream/processed
). When you invoke the Rerun from Raw data
mode of a pipeline workflow, the batches of all the sources that this pipeline is configured to process, are looked up in the processed
directory, and re-queued for processing.
There are 3 configuration options that control this rerun mode and they can be found in the /etc/squirro/common.ini
file.
Code Block | ||
---|---|---|
| ||
# whether to keep or not successfully processed data batches
keep_processed_data = true
# number of days and hours we keep around any processed data batches
# total time is days + hours
days_to_retain_processed_batches = 0
hours_to_retain_processed_batches = 1 |
If the keep_processed_data
is false
, then the no batches are moved into the processed
sub-directory.
The options days_to_retain_processed_batches
and hours_to_retain_processed_batches
control the time that the batches in the processed
directory are kept. When a batch exists this threshold is removed from the processed
directory.
Therefore, for the Rerun from Raw Data
mode to be available, the batches of the data sources you want to rerun need to be found in the processed
directory. When this mode is available, it is recommended to be used as it ensures a clean rerun.
For example, if you have a Pipeline Workflow with a sentence-level ML model, and you have ingested some data with it, you might have generated entities for your items. Now, if you perform a rerun without this ML model in your pipeline, then the entities will be deleted from your items. Similarly, if you perform a rerun with a different ML model in your pipeline, the old entities will be removed, and any entities from your new model will be there.
Rerun from Index
The rerun from index utilizes the actual Squirro Items which are stored in the storage layer of your Squirro instance.
...