Data Loader Tutorial

This tutorial goes step by step through using the Data Loader. This is first done by uploading a CSV file to Squirro. In the last step a custom source script is then implemented to implement a proprietary data format.

Setup

Installation

To get started, make sure the Toolbox is installed. The Toolbox page also contains some basic information to familiarize yourself with how to execute the commands contained in the Squirro Toolbox.

Folder Setup

For this tutorial create a new folder on your file system and use the command line to go into this window. On Windows this is achieved with the following procedure:

Start the "Command Prompt" from the Start menu.
Type "cd FOLDER" where FOLDER is the path to the folder where you are executing the tutorial.

Download Example Data

Please download the file interaction.csv and place it in the tutorial folder.

Set up Connection

The data loader connects with a Squirro instance. For this, three pieces of information are required:

cluster
token
project-id

A full manual on where to find these information can be found in Connecting to Squirro.

Once you have this information, go to the command prompt window and type the following commands:

Setting variables

set CLUSTER="..."
set TOKEN="..."
set PROJECT_ID="..."

Replace all the "..." preferences with their actual values.

CSV Import

The first project is to import the interactions.csv file into Squirro.

Step 1: Empty Invocation

Initially simply execute the squirro_data_load command with the --help argument. This will output a full list of options.

Data load command

squirro_data_load --help

If you get a help output back, this confirms that the Data Loader has been correctly installed.

A full reference documentation that is a bit easier to read than this output is available in the Data Loader Reference.

Step 2: Provide Configuration

The data loader command needs information about where to find the Squirro server, how to log in and into which project to import the data. Previously in the section Set up Connection we created variables for these settings. The way to use this variables is as follows:

Data load command

squirro_data_load ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN%

This command will fail, because further parameters are required. Those will be provided below.

Using variables here is optional and how exactly to insert them depends on the environment. The example here assumes a Microsoft Windows command prompt or Batch file.

Alternatively the parameters can also be hard-coded into the command. In that case, the command would look something like this:

Data load command

squirro_data_load ^
    --cluster https://squirro-cluster.example.com ^
    --project-id pB0JyUihQsOXGyLaUiFPDw ^
    --token 2df…6ba

All the further examples on this page will be using the approach using variables to facilitate copy/paste of the examples.

Step 3: Select source type and source file

The next step is to specify what data source is being imported. The first parameter for that is --source-type which selects the built-in source type to use. Depending on the source type additional parameters is required. For the "csv" source type that is the --source-file which points to the path where the CSV file can be found.

Data load command

squirro_data_load ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv

This command will still fail, because the mapping is missing. This is the next step

Step 4: Data mapping

Squirro has a specific [Item format] when importing data. The data loader needs to know which input fields correspond to which Squirro item field. That is done mainly through the --map-* arguments, such as -map-id or --map-title.

The following command is the first one that will execute and actually import some items from the CSV file:

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv ^
    --map-id InteractionID ^
    --map-title InteractionSubject

Open the Squirro project and there are four interactions that have been indexed into the project. They will only have a title, no other information.

Step 5: Map additional data

By mapping a few additional fields, the Squirro items are almost complete. The following command adds the creation date as well as a body.

This command also introduces the --source-name argument which gives the source a readable name. Before executing this command it's recommended to remove the previous source (called "Upload") from the project.

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv ^
    --map-id InteractionID ^
    --map-title InteractionSubject ^
    --map-created-at Date ^
    --map-body Notes ^
    --source-name "Interactions (CSV)"

Step 6: Configure facets

Squirro items can have keywords. These keywords are used for facet navigation, to easily group data in dashboards or navigate on semi-structured data. The example data contains a column called "InteractionType" which gives each interaction a type. That type should be made available in the Squirro user interface for easy searching and navigation.

For that, create a new file in the tutorial folder called "facets.json". That file provides a mapping of input data to the facets in Squirro.

facets.json

{
    "InteractionType": {
        "name": "Type"
    },
}

This example uses the "InteractionType" column from the input data and creates a keyword called "Type" in Squirro.

To make use of this, the import command is extended with the --facets-file argument:

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv ^
    --map-id InteractionID ^
    --map-title InteractionSubject ^
    --map-created-at Date ^
    --map-body Notes ^
    --source-name "Interactions (CSV)" ^
    --facets-file facets.json

When executing this, the Type facet is now available in the user interface. For example the following search query will now only return meetings: "Type:Meeting" (enter without the quotes).

Step 7: Facet data types

By default facets are plain text. Squirro also supports numeric and date types for facets. Also there is a simple configuration option to accept multi-value facets. The following example adds three new facets in the facets.json file:

facets.json

{
    "InteractionType": {
        "name": "Type"
    },
    "Attendees": {
        "name": "Attendees",
        "delimiter": ";",
    },
    "DurationMinutes": {
        "name": "Duration",
        "data_type": "int",
    },
    "NoAttendees": {
        "name": "Attendee Count",
        "data_type": "int",
    },
}

The import command remains unchanged.

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv ^
    --map-id InteractionID ^
    --map-title InteractionSubject ^
    --map-created-at Date ^
    --map-body Notes ^
    --source-name "Interactions (CSV)" ^
    --facets-file facets.json

When reloading the data in Squirro, note that Attendees has been imported with multiple values for each item. Also the duration and number of attendees has been imported as a numeric facet. Numeric facets can be used in the dashboards with calculations, such as outputting average or sum of values over time.

Step 8: Templating

So far the Squirro items contained a simple body that was taken directly from the CSV file. Often there is requirements to show more complex content, for example by combining multiple columns or adding conditions.

Add a new file called template-body.html. This is a Jinja2 template for formatting the body of the Squirro items.

template-body.html

<table>
    <tr>
        <th>Type</th>
        <td>{{ row.InteractionType }}</td>
    </tr>
    <tr>
        <th>Attendees</th>
        <td>{{ item.keywords.Attendees | join(', ') | e}}</td>
    </tr>
    <tr>
        <th>Duration</th>
        <td>{{ row.DurationMinutes }} minutes</td>
    </tr>
</table>

<hr>

<p>{{ row.Notes | e}}</p>

This template adds a summary table before the interaction notes which shows some of the metadata of the interaction.

The facets file remains unmodified.

facets.json

{
    "InteractionType": {
        "name": "Type"
    },
    "Attendees": {
        "name": "Attendees",
        "delimiter": ";",
    },
    "DurationMinutes": {
        "name": "Duration",
        "data_type": "int",
    },
    "NoAttendees": {
        "name": "Attendee Count",
        "data_type": "int",
    },
}

The command is changed slightly. First, the --map-body is replaced with --map-abstract. This ensures that the notes are still used for the overview in the search result screen. Then the --body-template-file parameter is used to specify the template that has been created above.

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv ^
    --map-id InteractionID ^
    --map-title InteractionSubject ^
    --map-created-at Date ^
    --map-abstract Notes ^
    --source-name "Interactions (CSV)" ^
    --facets-file facets.json ^
    --body-template-file template-body.html

After loading the data with this command, the individual interactions look a lot more readable in Squirro and give the user an easy glance at the essential information.

Step 9: Pipelets

In the Squirro pipeline, pipelets are used for enrichments that happen during data import. To re-use the power of this approach locally, the data loader also allows specification of pipelets. The interface is identical, so pipelets can be used locally with the data loader and/or remotely with the Squirro pipeline.

The goal of this example is to convert the interaction duration from minutes to hours.

As a first step, the facet configuration is extended. This now contains a new facet "Duration (Hours)" which uses the float data type.

facets.json

{
    "InteractionType": {
        "name": "Type"
    },
    "Attendees": {
        "name": "Attendees",
        "delimiter": ";",
    },
    "DurationMinutes": {
        "name": "Duration",
        "data_type": "int",
    },
    "DurationHours": {
        "name": "Duration (Hours)",
        "data_type": "float",
    },
    "NoAttendees": {
        "name": "Attendee Count",
        "data_type": "int",
    },
}

Next a pipelet configuration file is created which declares all the pipelets that are to be run.

pipelets.json

{
    "DurationConversionPipelet": {
        "file_location": "duration_conversion.py",
        "stage": "before templating",
    },
}

This configuration references a DurationConversionPipelet which is declared in the file duration_conversion.py. This is a code file written in the Python programming language. The contents is as follows and quite easy to follow with the inline comments.

duration_conversion.py

"""Pipelet to convert the duration from minutes to hours.

Takes the keyword `Duration` and moves the values into a new `Duration (Hours)`
keyword.
"""
from squirro.sdk import PipeletV1

class DurationConversionPipelet(PipeletV1):
    def consume(self, item):
        """This method is called once for each item to be processed."""

        # Names of the keywords (before and after)
        orig_kw_name = 'Duration'
        new_kw_name = 'Duration (Hours)'

        keywords = item.get('keywords', {})
        if orig_kw_name in keywords:
            # Convert every existing value to hours
            new_values = []
            for value in item.get('keywords', {}).get(orig_kw_name, []):
                new_values.append(round(float(value) / 60, 2))

            # Replace orig with new keyword
            del item['keywords'][orig_kw_name]
            item['keywords'][new_kw_name] = new_values

        # Return the modified item
        return item

The workings of this pipelet is quite easy. The main logic is on line 21. The rest of the code is tasked with correctly handling multiple values in the keyword (all keywords can have multiple values attached in Squirro) and switching the keywords.

Everything comes together again with the data load command. The only change needed is the --pipelets-file argument which points to the pipelets configuration file created above.

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-type csv ^
    --source-file interaction.csv ^
    --map-id InteractionID ^
    --map-title InteractionSubject ^
    --map-created-at Date ^
    --map-body Notes ^
    --source-name "Interactions (CSV)" ^
    --facets-file facets.json ^
    --pipelets-file pipelets.json

When executing this command, the items are loaded into Squirro. But now the duration has been changed to a float value representing the duration in hours instead of minutes.

Custom Data Source

The data loader can be easily extended to implement a custom data source. In this example a quick loader will be implemented that can handle PubMed data in the Medline format. Pubmed is a database of scientific publications for biomedical literature. The Medline format can be retrieved from the site using a simple export.

For this example you can use a list of 106 articles that have been manually extracted. Download the file pubmed.zip and extract it into the tutorial folder. This should create a folder called "pubmed".

A sample file in this folder looks as follows:

26785463.txt (PubMed Medline example)

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<pre>
PMID- 26785463
OWN - NLM
STAT- Publisher
DA  - 20160119
LR  - 20160119
IS  - 1095-9203 (Electronic)
IS  - 0036-8075 (Linking)
VI  - 350
IP  - 6265
DP  - 2015 Dec 4
TI  - Teamwork: The tumor cell edition.
PG  - 1174-1175
FAU - Cleary, Allison S
AU  - Cleary AS
AD  - Pennsylvania State University College of Medicine, Hershey PA 17078, USA.
      acleary@hmc.psu.edu.
LA  - ENG
PT  - JOURNAL ARTICLE
TA  - Science
JT  - Science (New York, N.Y.)
JID - 0404511
CRDT- 2016/01/20 06:00
AID - 350/6265/1174 [pii]
AID - 10.1126/science.aad7103 [doi]
PST - ppublish
SO  - Science. 2015 Dec 4;350(6265):1174-1175.
</pre>

It becomes quickly obvious that this is mostly a textual format consisting of key/value pairs.

To import this format, start by specifying the data load command.

Data load command

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-script medline.py ^
    --source-path pubmed ^
    --map-id PMID ^
    --map-title TI ^
    --map-created-at DA ^
    --map-body AB ^
    --source-name "PubMed" ^
    --facets-file facets.json

There is one key change in this command: instead of using the --source-type argument, this uses the --source-script. That script will be defined below and defines how Medline data is processed.

The mapping is done using these keys that were present above in the example.

The facets file is also quite straightforward and makes sure that some of those keywords are indexed as item keywords.

facets.json

{
    "DA": {
        "data_type": "datetime",
        "input_format_string": "%Y%m%d",
    },
    "JT": {
        "name": "Journal",
    },
    "PT": {
        "name": "Publication Type",
    },
    "PST": {
        "name": "Publication Status",
    },
    "OWN": {
        "name": "Owner",
    },
    "STAT": {
        "name": "Status",
    },
    "FAU": {
        "name": "Author",
        "delimiter": "|",
    }
}

The last step is to create the actual data source. That is a bit more involved. The main blocks are commented below. The goal of this data source is to go through all the medline files on the disk (as specified with the --source-path argument) and for each of those files return one dictionary. That dictionary is then processed by the data loader through the mappings, facet configurations, templates, etc. in the exact same way as if it had come straight from a CSV file or a SQL database.

medline.py

# -*- coding: utf-8 -*-
"""Data source implementation for PubMed Medline data.

Data is expected to be on disk, hierarchically stored in the `source_path`.
"""
import codecs
import collections
import logging
import os

from squirro.dataloader.data_source import DataSource

log = logging.getLogger(__name__)


KEYS = ['AB', 'CI', 'AD', 'IRAD', 'AID', 'AU', 'AUID', 'FAU', 'BTI', 'CTI',
        'CN', 'CRDT', 'DCOM', 'DA', 'LR', 'DEP', 'DP', 'EN', 'ED', 'FED',
        'EDAT', 'GS', 'GN', 'GR', 'IR', 'FIR', 'ISBN', 'IS', 'IP', 'TA', 'JT',
        'LA', 'LID', 'MID', 'MHDA', 'MH', 'JID', 'RF', 'OAB', 'OABL', 'OCI',
        'OID', 'OT', 'OTO', 'OWN', 'PG', 'PS', 'FPS', 'PL', 'PHST', 'PST',
        'PT', 'PUBM', 'PMC', 'PMCR', 'PMID', 'RN', 'NM', 'SI', 'SO', 'SFM',
        'STAT', 'SB', 'TI', 'TT', 'VI', 'VTI']


class MedLineSource(DataSource):
    def __init__(self):
        self.args = None

    def connect(self, inc_column=None, max_inc_value=None):
        """Create connection with the source."""
        if not os.path.isdir(self.args.source_path):
            raise IOError("Folder {} does not exist".format(self.args.source_path))

    def disconnect(self):
        """Disconnect from the source."""
        pass

    def getDataBatch(self, batch_size):
        """
        Generator - Get data from source on batches.

        :returns a list of dictionaries
        """
        for root, dirs, files in os.walk(self.args.source_path):
            items = []
            for fname in files:
                item = self._parse_file(os.path.join(root, fname))
                if item and (not item.get('TI') or not item.get('OWN')):
                    log.warn('Missing data %r', fname)
                elif item:
                    items.append(item)
            if items:
                yield items

    def getJobId(self):
        """
        Return a unique string for each different select
        :returns a string
        """
        return os.path.basename(self.args.source_path)

    def getSchema(self):
        """
        Return the schema of the data set
        :returns a List containing the names of the columns retrieved from the source
        """
        return KEYS

    def getArguments(self):
        """
        Return source arguments.
        """
        return [
            {
                "name": "source_path",
                "help": "Path of MedLine data folder.",
            }
        ]

    def _parse_file(self, file_name):
        """
        :param file: Medline text file
        :return: Dictionary with all the key/value pairs from the file.
                 Multi-value keys are joined with a pipe (`|`).
        """
        ret = collections.defaultdict(list)
        key = None
        value = None

        try:
            with codecs.open(file_name, encoding='utf8') as file:
                for line in file:
                    if 'Error occurred:' in line:
                        log.warn("Encountered error in file: %s", file_name)
                        return None
                    if line[0] == '<':
                        # Ignore the XML lines at the beginning and end.
                        continue
                    elif line[0:4].strip():
                        # This introduces a new key / value
                        if key:
                            ret[key].append(value)
                        key = line[0:4].strip()
                        value = line[6:].strip()
                    elif line.strip():
                        # No new key, this is a continuation of the value from
                        # the last key.
                        value += ' ' + line.strip()
                if key:
                    ret[key].append(value)

        except Exception as err:
            log.error("Problem parsing file: %s with error %r", file_name, err)

        item = {}
        for key, value in ret.iteritems():
            item[key] = '|'.join(value)
        for key in KEYS:
            if key not in item:
                item[key] = None
        return item

Conclusion

This concludes the tutorial - a whirlwind tour through the main features of the Squirro data loader.

For detailed documentation, please refer to the following documentation pages:

Data Loader Reference: a full reference of the command line arguments.
Data Loader Facet Config Reference: a reference of both the facets.json and pipelets.json files.
Data Loader Examples: additional examples on how to use the data loader.

This tutorial goes step by step through using the Data Loader. This is first done by uploading a CSV file to Squirro. In the last step a custom source script is then implemented to implement a proprietary data format.

Data Loader Tutorial

Table of Contents

Setup

Installation

Folder Setup

Download Example Data

Set up Connection

CSV Import

Step 1: Empty Invocation

Step 2: Provide Configuration

Step 3: Select source type and source file

Step 4: Data mapping

Step 5: Map additional data

Step 6: Configure facets

Step 7: Facet data types

Step 8: Templating

Step 9: Pipelets

Custom Data Source

Conclusion