Writing a custom data loader

This page will describe in detail how you can build a custom loader to work with data formats/inputs that are not supported out-of-the-box.

Prerequisites

Follow the steps outlined here: Data Loader Tutorial#Setup.

It is highly encouraged that before you install the Squirro Toolbox package, you create a python virtual environment (to isolate the packages) to work with.

Introduction

For any new data loader plugin create a new Python file. The Data loader plugin boilerplate template can be used to get started.

SDK reference

The plugin is implemented as an instance of the DataSource class. A number of methods must be implemented to provide the intended functionality. These special methods are all documented in DataSource Class.

Frontend-compatible loaders

Uploading

To provide a data loader plugin to the user in the user interface, it needs to be uploaded to the server. This is done using the squirro_asset command line tool.

See the full information on squirro_asset Command Line Reference, but in a nutshell this is how a data loader plugin can be uploaded:

squirro_asset dataloader_plugin upload --folder pubmed --token %TOKEN% --cluster %CLUSTER%

Linux

squirro_asset dataloader_plugin upload --folder pubmed --token $TOKEN --cluster $CLUSTER

Preview

Apart from technical implementation differences between the command line and frontend data load which are not visible to the users, the main consideration for writing a UI compatible loader is the preview mode.

See Data loader plugin preview for details.

Preview mode is a UI feature that enables the user to have a peak at the data before it is ingested into the system. It allows a preview of the first 10 items. For most use cases this should not present difficulties, but there are a few cases which might result in data loss.

Caching & Data storage

Data loader plugins often need to cache information or store certain progress information. For these purposes there are two types of stores that are available to use inside a data loader plugin:

key_value_cache
key_value_store

This is covered in Data loader API for Caching and Custom State Management.