Page Comparison

Excerpt
Data Loader Plugins are implemented by creating a Python class that inherits from the built-in `DataSource` class. This section documents all the methods that should be implemented to create such a plugin.

Methods

connect

connect(self, inc_column=None, max_inc_value=None)

This method is used to create the connection to the source. This can be used to for example connect to the database, open file pointers, or initiate network connections.

The parameters inc_column and max_inc_value are used for incremental loading. They contain the name of the column on which incremental loading is done (specified on the command line with --incremental-column) and the maximum value of that column that was received in the previous load. Well-behaved data loader plugins should implement this property, so that only results are returned where inc_column >= max_inc_value. The data loader takes care of all the required book-keeping.

Data loader plugins that do not support incremental loading, should raise an error when this option is specified:

Code Block

language	py

def connect(self, inc_column=None, max_inc_value=None):
    if inc_column:
        raise ValueError('Incremental loading not supported.')

Sometimes incremental column is only sensible and supported on one specific column. In that case, it is recommended to enforce that as well:

Code Block

language	py

def connect(self, inc_column=None, max_inc_value=None):
    if inc_column and inc_column != 'updated_at':
        raise ValueError('Incremental loading is only supported on the updated_at column.')

disconnect

disconnect(self)

Disconnect from the source if needed: for database - close connection, for file – close file etc. If nothing needs to be disconnected, this can be implemented as a simple pass:

Code Block

language	py

def disconnect(self):
    pass

getDataBatch

getDataBatch(batch_size)

This method returns the data in batches from the source. Use the yield keyword to iteratively "return" batches to the data loader. Return from the function when all the data has been returned.

Each batch is a list of dictionaries, where each dictionary is one record from the source and each key is the column name. This is the input used by the data loader in the various mapping options, facets and templates.

The batch size value is either determined by the data loader automatically, or set using the command line argument --source-batch-size.

A typical implementation of getDataBatch looks like this:

Code Block

language	py

def getDataBatch(batch_size):
    start = 0
    while True:
        ret = self.getRecords(start, batch_size)
        if ret:
            yield ret
            start = start + batch_size
        else:
            return

getJobId

getJobId(self)

Return a unique identifier for the current job. This should take into account all the arguments. The job ID is used for locking (to prevent multiple runs of the same load) as well as for storing the incremental value.

getSchema

getSchema(self)

Returns the header of the data source (list containing the name of the source columns). This is used to decide the valid mapping options and to expand the wildcards inside the facets configuration file.

If this is dynamically decided, it may make sense to return all the keys from the first result from getDataBatch. In the example above (where a custom method getRecords does the actual work), this could be implemented like this:

Code Block

language	py

def getSchema(self):
    items = self.getRecords(0, 1)
    if items:
        return items[0].keys()
    else:
        return []

getArguments

getArguments(self)

Return the list of arguments that the plugin accepts.

The result of this parsing is made available to the data loader plugin as the self.args object.

Each list item is a dictionary with the following options:

...

Mandatory - the name of the argument. Recommended naming convention is to keep it all lower case, and separate words with an underscore.

An option with the name mysource_password can be passed in from the command line as --mysource-password.

...

Examples

Code Block

language	py

def getArguments(self):
    return [
        {
            "name": "file",
            "flag": "f",
            "help": "Excel file to load",
            "required": true,
        },
        {
            "name": "excel_sheet",
            "default": 0,
            "type": "int",
            "help": "Excel sheet name. Default: get first sheet.",
        },
    ]


def connect(self, inc_column=None, max_inc_value=None):
    # Just an example for how to access the options
    self._file = open(self.args.file)

Empty Plugin

This is a boilerplate template for an data loader plugin.

...

language	py
title	loader_plugin.py

...

This page can now be found at DataSource Class on the Squirro Docs site.

Versions Compared

Old Version 8

New Version Current

Key

Table of Contents

Methods

connect

disconnect

getDataBatch

getJobId

getSchema

getArguments

Examples

Empty Plugin