Excerpt |
---|
The |
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
Motivation
Pipelets, just as other enrichments, are only executed on items that are loaded into a project after the pipelet has been configured. However it is often desired to run the pipelet on all previously loaded items without having to reload the items into Squirro. On the command line, this can be achieved using the pipelet rerun
command.
For an easier way to achieve this directly in the user interface, please see Rerun Enrichments .
Command
The basic command to rerun a pipelet is:
Code Block | ||
---|---|---|
| ||
pipelet rerun --cluster CLUSTER --token TOKEN --project-id PROJECT mypipelet.py |
This executes the pipelet contained in mypipelet.py
on all the items in the given project (see Connecting to Squirro for the cluster, token and project options).
It is possible, to limit the rerunning to just a subset of the project's items, by specifying a query (the connection options are now omitted for brevity):
Code Block | ||
---|---|---|
| ||
pipelet rerun --query 'big data' mypipelet.py |
To pass in configuration, that the pipelet needs, use the config
parameter which is a JSON string:
Code Block | ||
---|---|---|
| ||
pipelet rerun --config '{"file":"test.txt"}' mypipelet.py |
Limitations
Pipelet rerunning is implemented using the Update Item API. Because of this, the only changes that can be applied to an item are changes in the keywords. It is currently not possible to update any of the other item fields when rerunning a pipelet.
Versioning
A common use of pipelet rerun
is to change the way some keywords are calculated. To easily update the data, it is recommended to introduce a separate keyword for the pipelet's version. This way, the version can be incremented when the logic is improved, and the rerun command can be applied to all older items.
Take for example this pipelet:
Code Block | ||
---|---|---|
| ||
import re
from squirro.sdk import PipeletV1, require
VERSION = 1
@require('log')
class PricePipelet(PipeletV1):
"""Extract the price of the item from the body.
Searches for the first number prefixed with $ and uses that as the price.
"""
def __init__(self, config):
self.config = config
def consume(self, item):
body = item.get('body')
kw = item.setdefault('keywords', {})
kw['price_version'] = [VERSION]
if not body:
return item
match = re.search('\$(\d+)', body)
if not match:
return item
kw['price'] = int(match.group(1))
return item |
This sets a price_version
facet to the number 1
(the facet should be declared in the project as being a numeric facet).
Now when the pipelet is updated, the version can be incremented to VERSION = 2
. Then rerun can be called as follows:
Code Block |
---|
pipelet rerun --cluster CLUSTER --token TOKEN --project-id PROJECT --query '-price_version:2' price_pipelet.py |
This runs the pipelet on all items that do not have the price_version
set to the value 2 - either the value hasn't been set at all, or it's still on a different versionThis page can now be found at Rerunning a Pipelet on the Squirro Docs site.