...
Excerpt |
---|
This tutorial goes step by step through using the Data Loader. This is first done by uploading a CSV file to Squirro. In the last step a custom source script is then implemented to implement a proprietary data format. |
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
Setup
Installation
To get started, make sure the Toolbox is installed. The Toolbox page also contains some basic information to familiarize yourself with how to execute the commands contained in the Squirro Toolbox.
...
For this tutorial create a new folder on your file system and use the command line to go into this window. On Windows this is achieved with the following procedure:
Start the "Command Prompt" from the Start menu.
Type "cd FOLDER" where FOLDER is the path to the folder where you are executing the tutorial.
Download Example Data
Please download the file interaction.csv and place it in the tutorial folder.
...
The data loader connects with a Squirro instance. For this, three pieces of information are required:
cluster
token
project-id
A full manual on where to find these information can be found in Connecting to Squirro.
Once you have this information, go to the command prompt window and type the following commands:
Setting variables
Code Block | ||||
---|---|---|---|---|
| ||||
set CLUSTER="..." set TOKEN="..." set PROJECT_ID="..." |
...
Initially simply execute the squirro_data_load
command with the --help
argument. This will output a full list of options.
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load --help |
If you get a help output back, this confirms that the Data Loader has been correctly installed.
...
The data loader command needs information about where to find the Squirro server, how to log in and into which project to import the data. Previously in the section Set up Connection we created variables for these settings. The way to use this variables is as follows:
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% |
...
Alternatively the parameters can also be hard-coded into the command. In that case, the command would look something like this:
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ --cluster https://squirro-cluster.example.com ^ --project-id pB0JyUihQsOXGyLaUiFPDw ^ --token 2df…6ba |
...
The next step is to specify what data source is being imported. The first parameter for that is --source-type
which selects the built-in source type to use. Depending on the source type additional parameters is required. For the "csv" source type that is the --source-file
which points to the path where the CSV file can be found.
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv |
...
The following command is the first one that will execute and actually import some items from the CSV file:
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ -v ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv ^ --map-id InteractionID ^ --map-title InteractionSubject |
...
This command also introduces the --source-name
argument which gives the source a readable name. Before executing this command it's recommended to remove the previous source (called "Upload") from the project.
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ -v ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv ^ --map-id InteractionID ^ --map-title InteractionSubject ^ --map-created-at Date ^ --map-body Notes ^ --source-name "Interactions (CSV)" |
...
For that, create a new file in the tutorial folder called "facets.json". That file provides a mapping of input data to the facets in Squirro.
facets.json
Code Block | ||||
---|---|---|---|---|
| ||||
{ "InteractionType": { "name": "Type" }, } |
...
To make use of this, the import command is extended with the --facets-file
argument:
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ -v ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv ^ --map-id InteractionID ^ --map-title InteractionSubject ^ --map-created-at Date ^ --map-body Notes ^ --source-name "Interactions (CSV)" ^ --facets-file facets.json |
...
By default facets are plain text. Squirro also supports numeric and date types for facets. Also there is a simple configuration option to accept multi-value facets. The following example adds three new facets in the facets.json file:
facets.json
Code Block | ||||
---|---|---|---|---|
| ||||
{ "InteractionType": { "name": "Type" }, "Attendees": { "name": "Attendees", "delimiter": ";", }, "DurationMinutes": { "name": "Duration", "data_type": "int", }, "NoAttendees": { "name": "Attendee Count", "data_type": "int", }, } |
The import command remains unchanged.
Data load command
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
squirro_data_load ^ -v ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv ^ --map-id InteractionID ^ --map-title InteractionSubject ^ --map-created-at Date ^ --map-body Notes ^ --source-name "Interactions (CSV)" ^ --facets-file facets.json |
...
Add a new file called template-body.html
. This is a Jinja2 template for formatting the body of the Squirro items.
template-body.html
Code Block | ||||
---|---|---|---|---|
| ||||
<table> <tr> <th>Type</th> <td>{{ row.InteractionType }}</td> </tr> <tr> <th>Attendees</th> <td>{{ item.keywords.Attendees | join(', ') | e}}</td> </tr> <tr> <th>Duration</th> <td>{{ row.DurationMinutes }} minutes</td> </tr> </table> <p>{{ row.Notes | e}}</p> |
...
The facets file remains unmodified.
facets.json
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "InteractionType": { "name": "Type" }, "Attendees": { "name": "Attendees", "delimiter": ";", }, "DurationMinutes": { "name": "Duration", "data_type": "int", }, "NoAttendees": { "name": "Attendee Count", "data_type": "int", }, } |
The command is changed slightly. First, the --map-body
is replaced with --map-abstract
. This ensures that the notes are still used for the overview in the search result screen. Then the --body-template-file
parameter is used to specify the template that has been created above.
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ -v ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv ^ --map-id InteractionID ^ --map-title InteractionSubject ^ --map-created-at Date ^ --map-abstract Notes ^ --source-name "Interactions (CSV)" ^ --facets-file facets.json ^ --body-template-file template-body.html |
...
Step 9: Pipelets
In the Squirro pipeline, pipelets are used for enrichments that happen during data import. To re-use the power of this approach locally, the data loader also allows specification of pipelets. The interface is identical, so pipelets can be used locally with the data loader and/or remotely with the Squirro pipeline.
...
As a first step, the facet configuration is extended. This now contains a new facet "Duration (Hours)" which uses the float data type.
facets.json
Code Block | ||||
---|---|---|---|---|
| ||||
{ "InteractionType": { "name": "Type" }, "Attendees": { "name": "Attendees", "delimiter": ";", }, "DurationMinutes": { "name": "Duration", "data_type": "int", }, "DurationHours": { "name": "Duration (Hours)", "data_type": "float", }, "NoAttendees": { "name": "Attendee Count", "data_type": "int", }, } |
Next a pipelet configuration file is created which declares all the pipelets that are to be run.
pipelets.json
Code Block | ||||
---|---|---|---|---|
| ||||
{ "DurationConversionPipelet": { "file_location": "duration_conversion.py", "stage": "before templating", }, } |
This configuration references a DurationConversionPipelet
which is declared in the file duration_conversion.py
. This is a code file written in the Python programming language. The contents is as follows and quite easy to follow with the inline comments.
...
duration_conversion.py
Code Block | ||
---|---|---|
| ||
"""Pipelet to convert the duration from minutes to hours. Takes the keyword `Duration` and moves the values into a new `Duration (Hours)` keyword. """ from squirro.sdk import PipeletV1 class DurationConversionPipelet(PipeletV1): def consume(self, item): """This method is called once for each item to be processed.""" # Names of the keywords (before and after) orig_kw_name = 'Duration' new_kw_name = 'Duration (Hours)' keywords = item.get('keywords', {}) if orig_kw_name in keywords: # Convert every existing value to hours new_values = [] for value in item.get('keywords', {}).get(orig_kw_name, []): new_values.append(round(float(value) / 60, 2)) # Replace orig with new keyword del item['keywords'][orig_kw_name] item['keywords'][new_kw_name] = new_values # Return the modified item return item |
...
Everything comes together again with the data load command. The only change needed is the --pipelets-file
argument which points to the pipelets configuration file created above and the --transform-items
argument.
Data load command
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^ -v ^ --cluster %CLUSTER% ^ --project-id %PROJECT_ID% ^ --token %TOKEN% ^ --source-type csv ^ --source-file interaction.csv ^ --map-id InteractionID ^ --map-title InteractionSubject ^ --map-created-at Date ^ --map-body Notes ^ --source-name "Interactions (CSV)" ^ --facets-file facets.json ^ --pipelets-file pipelets.json --transform-items |
...
For detailed documentation, please refer to the following documentation pages:
Data Loader Reference: a full reference of the command line arguments.
Data Loader Facet Config Reference: a reference of both the
facets.json
andpipelets.json
files.Data Loader Examples: additional examples on how to use the data loader.
This tutorial goes step by step through using the Data Loader. This is first done by uploading a CSV file to Squirro. In the last step a custom source script is then implemented to implement a proprietary data format.