Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Introduction

The Pipeline Workflows support processing data that have been already processed once.

...

Rerun from Raw Data

The rerun from raw data utilizes uses the data in the form as they were retrieved by a Squirro data loader plugin from the actual data source .

However, please note that this mode is not always available as it requires the storage of the raw data. Also, it is not guaranteed that it will be possible to use all the raw data from the data sources of the pipeline workflow because there is a configurable data retention policy in place which periodically removes these data from the server.

When it and added into ingester’s queue.

The ingester service processes the data of a source in one or more batches. When the ingester processes a batch, it moves it to the processed sub-directory of its filesystem queue (default location: /var/lib/squirro/inputstream/processed). When you invoke the Rerun from Raw data mode of a pipeline workflow, the batches of all the sources that this pipeline is configured to process, are looked up in the processed directory, and re-queued for processing.

There are 3 configuration options that control this rerun mode and they can be found in the /etc/squirro/common.ini file.

Code Block
languagebash
# whether to keep or not successfully processed data batches
keep_processed_data = true

# number of days and hours we keep around any processed data batches
# total time is days + hours
days_to_retain_processed_batches = 0
hours_to_retain_processed_batches = 1

If the keep_processed_data is false, then the no batches are moved into the processed sub-directory.

The options days_to_retain_processed_batches and hours_to_retain_processed_batches control the time that the batches in the processed directory are kept. When a batch exists this threshold is removed from the processed directory.

Therefore, for the Rerun from Raw Data mode to be available, the batches of the data sources you want to rerun need to be found in the processed directory. When this mode is available, it is recommended to be used as it ensures a clean rerun.

For example, if you have a Pipeline Workflow with a sentence-level ML model, and you have ingested some data with it, you might have generated entities for your items. Now, if you perform a rerun without this ML model in your pipeline, then the entities will be deleted from your items. Similarly, if you perform a rerun with a different ML model in your pipeline, the old entities will be removed, and any entities from your new model will be there.

Rerun from Index

The rerun from index utilizes the actual Squirro Items which are stored in the storage layer of your Squirro instance.

...