Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

The pipelet rerun command can be used to enrich previously loaded items using Pipelets.

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Motivation

Pipelets, just as other enrichments, are only executed on items that are loaded into a project after the pipelet has been configured. However it is often desired to run the pipelet on all previously loaded items without having to reload the items into Squirro. On the command line, this can be achieved using the pipelet rerun command.

For an easier way to achieve this directly in the user interface, please see Rerun Enrichments .

Command

The basic command to rerun a pipelet is:

Code Block
languagebash
pipelet rerun --cluster CLUSTER --token TOKEN --project-id PROJECT mypipelet.py

This executes the pipelet contained in mypipelet.py on all the items in the given project (see Connecting to Squirro for the cluster, token and project options).

It is possible, to limit the rerunning to just a subset of the project's items, by specifying a query (the connection options are now omitted for brevity):

Code Block
languagebash
pipelet rerun --query 'big data' mypipelet.py

To pass in configuration, that the pipelet needs, use the config parameter which is a JSON string:

Code Block
languagebash
pipelet rerun --config '{"file":"test.txt"}' mypipelet.py

Limitations

Pipelet rerunning is implemented using the Update Item API. Because of this, the only changes that can be applied to an item are changes in the keywords. It is currently not possible to update any of the other item fields when rerunning a pipelet.

Versioning

A common use of pipelet rerun is to change the way some keywords are calculated. To easily update the data, it is recommended to introduce a separate keyword for the pipelet's version. This way, the version can be incremented when the logic is improved, and the rerun command can be applied to all older items.

Take for example this pipelet:

Code Block
languagepy
import re

from squirro.sdk import PipeletV1, require


VERSION = 1


@require('log')
class PricePipelet(PipeletV1):
    """Extract the price of the item from the body.

    Searches for the first number prefixed with $ and uses that as the price.
    """
    def __init__(self, config):
        self.config = config

    def consume(self, item):
        body = item.get('body')
        kw = item.setdefault('keywords', {})
        kw['price_version'] = [VERSION]

        if not body:
            return item

        match = re.search('\$(\d+)', body)
        if not match:
            return item

        kw['price'] = int(match.group(1))
        return item

This sets a price_version facet to the number 1 (the facet should be declared in the project as being a numeric facet).

Now when the pipelet is updated, the version can be incremented to VERSION = 2. Then rerun can be called as follows:

Code Block
pipelet rerun --cluster CLUSTER --token TOKEN --project-id PROJECT --query '-price_version:2' price_pipelet.py

This runs the pipelet on all items that do not have the price_version set to the value 2 - either the value hasn't been set at all, or it's still on a different versionThis page can now be found at Rerunning a Pipelet on the Squirro Docs site.