Data Loader Examples



This page shows a number of examples for using the Data Loader.

 For a full reference documentation please see Data Loader Reference. For a step-by-step tutorial, please refer to the Data Loader Tutorial.

Table of Contents

Sample data

The data loader tool requires a tabular input format (CSV, Excel or database connection). All three data sources are explained below and there is an example for each.

Excel

The following sample Excel file is used to highlight the Squirro Data Loader capabilities for loading an Excel file. The file contains a single header row and three data rows.

Example:

Office Excel: spreadsheet preview is finally here!

You can now display this file as a spreadsheet, instead of a PDF. Re-upload interaction.xlsx to display the new preview. You can access the file here.



CSV

The following sample CSV file is used to highlight the Squirro Data Loader capabilities for loading a CSV file. The file contains a single header row and three data rows. The sample file uses the following formatting options:

  • Field delimiter: , (comma)

  • Text delimiter used for fields with multiple values: ; (semicolon)

Example:

InteractionID,InteractionSubject,InteractionType,Date,Notes,InternalAttendees,NoAtendees,EntittledUsers,DateModifiedOn 100AAA,Tesla,Phone Call,2015-4-23T00:00:00,note1,Joe Average;Jane Houseowner;Elon Musk,3,jdoe;jhouseowner;emusk,4/23/2015 16:58 102BBB,PayPal,Meeting,2015-4-24T00:00:00,note2,Joe Average; Elon Musk,2,jdoe;emusk,4/24/2015 16:58 105CCC,Tesla,Meeting,2015-4-25T00:00:00,note3,Joe Average; Jane Houseowner,2,jdoe;jhouseowner,4/25/2015 16:58



Database

The following database sample inserts are used to highlight the capabilities of the Data Loader regarding the load from a database table. The example below is for an SQL Server database.

The database table creation:

CREATE TABLE Interaction_test ( InteractionId varchar(10) NOT NULL, InteractionSubject varchar(500) NOT NULL, InteractionType varchar(100) NOT NULL, Date date, Notes varchar(max), InternalAttendees varchar(500), NoAttendees int, EntittledUsers varchar(100), DateModifiedOn datetime NOT NULL );

Example (DB INSERTS):

INSERT INTO Interaction_test (InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn) VALUES ('100AAA', 'Tesla', 'Phone Call', '4/23/2015', 'note1', 'Joe Average;Jane Houseowner;Elon Musk', 3, 'jdoe;jhouseowner;emusk', '4/23/2015 16:58'); INSERT INTO Interaction_test (InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn) VALUES ('102BBB', 'PayPal', 'Meeting', '4/24/2015', 'note2', 'Joe Average;Elon Musk', 2, 'jdoe;emusk', '4/24/2015 16:58'); INSERT INTO Interaction_test (InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn) VALUES ('105CCC', 'Tesla', 'Meeting', '4/25/2015', 'note3',  'Joe Average;Jane Houseowner', 2, 'jdoe;jhouseowner', '4/25/2015 16:58');

Data mapping

Some (or all) of the source columns can be mapped to the Squirro item fields using the command line arguments. In order for items to be loaded into Squirro, either title or body needs to be specified, all the other fields are optional. Also the --map-flag and --map-id parameters are mutually exclusive.

Field Name

Mandatory

Description

Example

title

Yes*

Item title.

“Conversation1”

abstract

No

Item summary.

“Phone call”

created_at

No

Item creation date and time (see Date and Time).

02/19/2016

id

No

External item identifier. Maximum length is 256 characters.

100AAA

body

Yes*

HTML markup with full representation of the item content.

Note1

url

No

URL which points to the item origin.

*example.com*

file_name

No

LOB file name



file_mime

No

LOB file mime type



file_data

No





file_ compressed

No





flag

No





*It is mandatory that at least one of the title or body fields is mapped.

Usage

You can find a simple example below on basic mapping. For this example we used the interaction.csv file provided in the Sample Data section. We are mapping here some of the fields from the CSV file to the following item fields: title, id, abstract, created-at and 3 fields for the item body.

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Simple usage and flow control

In this section we will describe and show the usage of the arguments influencing the load type.

Data source arguments: source-name, source-path, source-type and source-file (input-file)

There are three load examples below, one for each source type: CSV, Excel and database. The --source-name argument is used to set the name of the source in Squirro. The --source-type argument tells the Data Loader what kind of source is loading. If the --source-script parameter is set, this will show the location of the DataSource python script. Note that the arguments --source-type and --source-script are mutually exclusive.

The --source-file argument specifies the path to the file from which we will load the data. For the database load the --input-file is used instead to specify the sql statement used to retrieve data from the database.

CSV Example



Excel Example



SQL Example

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

interaction.sql (used for SQL example)

Note: the SQL file must contain only one query and it shouldn't end with semicolon (;)

In Squirro, in the Data tab, you can see the result of the loads from the three different sources:



Database reset and incremental-column arguments

For database loads an incremental load can be executed. In the example below, an incremental load is performed using the --incremental-column argument to specify the date field used for incremental reference.

First run the initial load, which will be a full load.

Note: The full load doesn't contain the --incremental-column argument. If an incremental load is executed after a full load, it will do an initial load, re-indexing everything - since there is no information on the last loaded row from the full load.

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Another row was inserted in the meantime:

When re-executing incremental load, we should only have one row loaded into Squirro, the row that was just inserted in the database.



When using the --reset argument (together with the --incremental-column) the --incremental-column argument will be ignored and a full load will be performed, thus all three rows will be loaded:

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

ItemUploader arguments: object-id, source-id, source-name, bulk-index, bulk-index-add-summary-from-body, batch-size

We can set these arguments in the command line and sent them directly to the ItemUploader module. They are not changed anywhere in the Data Loader. You can find more information on them in the ItemUploader documentation.

Usage for object-id, source-id and source-name

For this example we will use the CSV file again. In order to load the items into an existing source we should use one of the arguments: object-id, source-id or source-name. When using the object-id and source-id arguments, first we need to get these ids from the cluster, using get_user_objects() and get_object_subscriptions(). That is, if we don’t already know them. Or we can use --source-name which we can get in a more simple way, by checking the Data tab in Squirro.

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Usage for bulk-index, bulk-index-add-summary-from-body, batch-size

The ItemUploader arguments bulk-index and bulk-index-add-summary-from-body are used for indexing data in bulk. The batch-size argument will tell the ItemUploader the number of items to send in one request. This should be lower than 500 depending on the setup and it defaults to auto.

You can find more information on those in the ItemUploader documentation.

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Parallel executors and source batch size arguments: parallel-uploaders, source-batch-size

Whenever using the --parallel-uploaders parameter (default is 1) the Data Loader will load data in parallel. In the next usage example, --parallel-uploaders is set to 5 and assuming we have 1000 rows to load, each executor will upload 200 rows. If we set the --source-batch-size to 50 then the data is first split into batches consisting of 50 rows each (that is 20 batches) and then, each batch is distributed to 5 executors, thus each executor uploading 10 rows at a time until all rows are uploaded.

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

The SQLlite metadata database is used for the incremental loads and the locking mechanism. By default, the database file is created in the user home folder and has the name sq_data_load.db but this can be changed with the following parameters:

  • --meta-db-dir sets the directory

  • --meta-db-file sets the name of the SQLlite metadata database

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Facets

In this chapter we will describe the use and functionality of the facets configuration file. The facets/keywords are used for filtering data in Squirro.

This file will subsequently be used for three scenarios:

  • Loading the facets/keywords into Squirro

  • Formatting the data coming from the source (for data types int, float and dates)

  • Mapping the source columns to the Squirro facets

The facet configuration file contains the attributes of the columns mapped as facets in Squirro and has a hjson format, which is a more relaxed and easier to read json format. To run the Data Loader script with the facet information, we need to use the --facets-file parameter with the full path to the JSON file.

Facet functionality

Facet creation on Squirro Server

When creating or updating facets in Squirro, the following attributes are taken into account (and subsequently added/updated):

  • name

  • group_name

  • display_name

  • data_type

  • visible

  • searchable

  • format_instr 

The attributes used for formatting of the item:

  • default_value

  • data_type

  • input_format_string

  • delimiter

Attribute used for skipping columns: 

  • import

Attributes used for access control:

  • auth

  • auth_value_mandatory

  • delimiter

Pivotal attributes:

  • pivotal_group

  • delimiter

You can find more information on these attributes in the Data Loader Facet Config Reference.

Below is a basic example for the facets configuration file and the mapping between the source field and the facet names. We will add attributes to the file as we describe more and more functionality. The “name” attribute is the one that identifies a facet in Squirro and not the key of each facet dictionary, unless the “name” attribute is missing and then the facet is identified by the dictionary key. The key of each facet dictionary is in fact the field name from the data source.

config/sample_facets.json

And the command line call:

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

You can find the loaded facets in the Data >> Facets tab. You can see below the Facets tab before loading and after:





You can also see the new facets in the Search tab. Notice that the name of the selected facet is ‘Subject’ - the “display_name” attribute of the facet - coming from the "InteractionSubject" field in the data source which is the key of the facet dictionary.  And the 2 distinct values of the subject field are shown:  ‘Tesla’ and ‘PayPal’.



Data formatting

You may have noticed the formatting on the datetime data type with the "input_format_string" but there is even more formatting. The attribute "input_format_string" is applied to the data loaded into Squirro, more precisely to the date/datetime data types and the "format_instr" is applied on facet values in Squirro.

  • input_format_string:  is the format of the date/datetime strings coming from the source (used for csv and excel, the date values coming from the database are of type datetime and don’t need a format).

  • format_instr: this is the formatting instruction for the facet value display. This parameter defines how the Squirro UI will display the value. This is also relevant for int and float data types. It’s also documented here.



Note that the "input_format_string" should have a Python date format and "format_instr" is just generic date formatting.

Other uses

Other uses of the facets config file are:

  • For Access Control support - “auth” attributes - True/False, default False. One or multiple columns from the source can be used for access control. The values of the specified columns are used to determine user access to the Squirro item.

  • For pivotal columns -“pivotal_group” attribute - if this attribute exists than the column is also a pivotal column and is related to all other columns having the same group

  • For specifying if a column will be loaded as a facet or not - "import" attribute - True/False, default True.

The use of these attributes will be exemplified in the following sections.

Facet attributes examples of usage

Wildcards

If there is a case where we want to use the same attribute for multiple facets, we can use the wildcard '*' (see example below) in the key facet name to describe them. This configuration will take all the columns that start with the word “Interaction” from the source and send them to Squirro. According to our sample CSV data these columns will be: InteractionId, InteractionSubject, InteractionType.

 If used by itself, the wildcard will load all the remaining facets, the ones that are not specified by name/key in the facets config file. Also, the '*' can be used only at the end of the column name (e.g. Column* is OK, C^*lumn is NOT).

Another thing to keep in mind is that when using the wildcard on a series of similar fields, the attributes of the field(s) not covered by the expansion of the wildcard (i.e. the ones specified by their full name) are not overwritten by the wildcard attributes. For example, we defined in the config file the field "InteractionId" with “visible” and “searchable” attributes set to “true” and "Interaction*"  with the same attributes set to “false”. After the expansion of the wildcard the "InteractionId" facet will still have “visible” and “searchable” attributes set to “true” and only "InteractionSubject" and "InteractionType" will have the same attributes set to “false”.

Skipping columns from source

If we do not want to load some of the columns from the source as facets or we are using these columns only for data manipulation, we can use the following:

  • “import”: false

  • {Facet_key: false} - shortcut for "import": false

  • Do not add the column at all in the facets config file

Note: the data from these columns will still be loaded into Squirro, just the names of this columns will not be loaded as facets/keywords.

In the example below, none of the two columns (Notes and InternalAttendees) will be loaded into Squirro as facets.

Access Control

Another use of “import”: false is with the access controlled columns since we don’t normally want these columns to show as keywords:

The "auth": true specifies that this column is used for acces control and “auth_value_mandatory”: true if value is empty the item will not be uploaded.

Adding columns in pipelets

The pipelets are mostly used in the load to alter or generate new data based on the input. The columns generated using the pipelets can be defined in the facets configuration file in the same way as a source column. The only difference is that the formatting (value splitting, date reformat, string to int/float conversions, applying default values) is done at the end of the process.

Note1: The pipelet’s consume method is called for each row, meaning that it has access only to that row – cannot aggregate data from multiple rows.

Note2: For the columns added from pipelets and not configured as facets, default facets of type string will be created on the first load.

Default fields

Another attribute that can be set for a column in the facets file is the default value. So, if there is a default value this will replace all the null values coming for the data source when the item is loaded into Squirro.

Note: Zero values are not replaced with the default value.

Columns containing separators

The “delimiter” attribute can be specified for certain columns that have multiple values. For example, in our CSV file the following two columns have multiple values: InternalAttendees and EntittledUsers. For both these columns the values are separated by a semicolon (“;”). In this case, the facets config file will look as follows:

Data manipulation

Templating and pivot columns

The body, title and summary/abstract of the Squirro item can be formatted. This is possible by applying a Jinja2 template and using templates files supplied as arguments. The following arguments can be supplied from the command line for templating:

  • --body-template-file

  • --title-template-file

  • --abstract-template-file

From the Jinja templates, we have access to the source data using 3 variables:

Variable Name

Type

Description

Variable Name

Type

Description

row

Python dictionary

Data as extracted from the source on format {key, value} key is the column name and value is the value of the key on one row

Example accessing the value of a column:



item

Python dictionary

State of the Squirro item after applying the mapping, data conversions and before pipelets.

Example accessing a facet from the item:



pivot_data

Python dictionary

Computed pivot data:

If 2 tables are joined on a relation of 1 to n, the n rows are pivoted to generate only one line on the source.

The pivot logic on the Data Loader recreates the rows from the joined table to use them in the Jinja template.

The source extraction logic can be very complex and multiple joins with pivoted columns can be used. In order to differentiate columns from different the joins,

the tool uses groups. if 2 columns are from the same group, it means that they were selected from the same joined table.

Example:

Source tables joined on InteractionID column:



Result is:

Data Loader pivot_data for the first line is:



The pivot columns are described using the facets configuration file. The attributes describing a column as pivotal are:

  • pivotal_group: the name of the group

  • delimiter: usually the pivot columns must be split (we covered this attribute in the section Columns containing separators)

Example of usage in the template html file:



For this example we will use the CSV file again and we will format the body of the item using the --body-template-file argument.

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Note: the name that we use for the facet in the “name” attribute in the facets config file has to be identical with the name from item.keywords tag in the template file

The HTML template file:

And the output in Squirro looks like this:

We can also have pivot columns in our data source. This script has the ability to expand these columns and use them to enrich the Squirro item using the templating functionality.

The original CSV source file was slightly modified to accommodate two pivotal columns: Notes and InternalAtendees.

In the facets config file, the two column descriptions look like below: 

config/sample_facets.json

And the template file:

In Squirro, the items with pivoted columns will look like this:



Pipelets

This data loader has the ability to run multiple pipelets before or after the templating step. The pipelets are external Python classes that are able to modify the Squirro item. The pipelets are instantiated once for each parallel process, but the code inside is executed once for each item (it is not recommended to open files in the consume method of the pipelet, you have to do it in init()).

The pipelets location, input arguments and execution order is described in a json file supplied as an argument for the loader script --pipelets-file. You can find more information on the attributes used in this file in the Data Loader Facet Config Reference. Also, the --pipelets-error-behavior specifies the jobs behaviour in case a pipelet rises an exception. By default is “error”. In case of error, the load will stop. The other option is "warning", which will only log the warning and continue the load.



Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

We will use the following pipelets config file:

config/sample_pipelets.json

And in Python pipelet script, consume method just adds the name ‘Squirro’ to the title of the item :

sample_pipelet.py

  The result of our simple pipelet is shown below:

 



Complex job setup

You can find below a more complex example of a database load along with the configuration files, the template file, the python module and the command line parameters.

In this example we will incrementally load a few rows from a database table, using 2 parallel executors. We will be using a template file with a pivotal table for formatting the item body. The id, abstract, created-at and title of the body will be mapped from the command line arguments. The facets will be configured in the json configuration file: sample_facets_pivot.json. We will also create a simple pipelet that adds the keyword "comments" tom the item and assigns a text to it in the item body. The SQLite metadata database will also be set from the command line.

The database source:

The command line:

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

The contents of the interactions.sql:

interactions.sql

The template Jinja2 file is shown below. The column "comments", coming from the pipelet is added to the item body.

template/template_body_interaction.html

The complete facets configuration json file. Notice how the column "comments", comming from the pipelet was added in this json file, without being present in the source.

config/sample_facets.json

 And the pipelets configuration file:

config/sample_pipelets.json

The consume() method for the pipelet:

sample_pipelet.py

The output in Squirrro:

And the created facets:



And finally the contents of the log file, log.txt:

log.txt