Page Comparison

Excerpt
Pipelets are plugins to the Squirro pipeline, used to customize the data processing.

Overview

Items that are processed by Squirro go through a pipeline process before they are indexed. In that process a number of built-in enrichments are executed. On top of that, custom enrichment steps can be inserted in the form of pipelets. These pipelets are written in the Python programming language. Pipelets can be uploaded to the Squirro server and then be configured in the user interface (Enrichments tab) or through the API.

This reference documentation covers the basic workflow of working with pipelets and the interface that pipelets need to implement.

Writing Pipelets

Pipelets are written in Python. They need to inherit from the squirro.sdk.PipeletV1 class and implement the consume method. The simplest possible Pipelet looks like this:

Code Block
from squirro.sdk import PipeletV1 class NoopPipelet(PipeletV1): def consume(self, item): return item

As it name says it does nothing but return the item unchanged. The item can be modified before it is returned. For example:

Code Block
from squirro.sdk import PipeletV1 class ModifyTitlePipelet(PipeletV1): def consume(self, item): item['title'] = item.get('title', '') + ' - Hello, World!' return item

This pipelet will modify each item it processes, appending the string "Hello, World!" to the title. All the item's fields can be modified. The available fields are documented in the Item Format reference.

Returning multiple items

The pipelet is always called for each item individually. But in some use cases the pipelet should not just return one item but multiple ones. In those cases use the Python yield statement to return each individual item. For example:

Code Block

from squirro.sdk import PipeletV1
 
class ExtendTitlePipelet(PipeletV1):
    def consume(self, item):
        for i in range(10):
            new_item = dict(item)
            new_item['title'] = '{0} ({1})'.format(item.get('title', ''), i)
            yield new_item

Dependencies

Pipelets are limited in what you can do. For example the print statement is disallowed and you can not import any external libraries except squirro.sdk. If you do need access to external libraries, you need to use the @require decorator. For example to log some output:

Code Block
from squirro.sdk import PipeletV1, require @require('log') class LoggingPipelet(PipeletV1): def consume(self, item): self.log.debug('Processing item: %r', item['id']) return item

As seen from the example, the @require decorator takes a name of a dependency. That dependency is then made available to the pipelet class.

HTTP requests can be executed by using the requests dependency. The following pipelet shows an example for sentiment detection:

Code Block

from squirro.sdk import PipeletV1, require

@require('requests')
class SentimentPipelet(PipeletV1):
    def consume(self, item):
        text_content = ' '.join([item.get('title', ''),
                                 item.get('body', '')])
        res = self.requests.post('http://example.com/detect',
                                 data={'text': text_content},
                                 headers={'Accept': 'application/json'})
        sentiment = res.json()['sentiment']
        item.setdefault('keywords', {})['sentiment'] = [sentiment]
        return item

Available Dependencies

The following dependencies can be requested:

Dependency	Description
`cache`	Non-persisted cache.
`log`	A `logging.Logger` instance from Python's standard logging framework.
`requests`	Python requests library for to execute HTTP requests.

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Overview

Writing Pipelets

Returning multiple items

Dependencies

Available Dependencies