Data Loader Examples
This page shows a number of examples for using the Data Loader.
For a full reference documentation please see Data Loader Reference. For a step-by-step tutorial, please refer to the Data Loader Tutorial.
Table of Contents
- 1 Sample data
- 2 Data mapping
- 2.1 Usage
- 3 Simple usage and flow control
- 3.1 Data source arguments: source-name, source-path, source-type and source-file (input-file)
- 3.1.1 CSV Example
- 3.1.2 Excel Example
- 3.1.3 SQL Example
- 3.1.4 interaction.sql (used for SQL example)
- 3.2 Database reset and incremental-column arguments
- 3.3 ItemUploader arguments: object-id, source-id, source-name, bulk-index, bulk-index-add-summary-from-body, batch-size
- 3.4 Parallel executors and source batch size arguments: parallel-uploaders, source-batch-size
- 3.5 Metadata database related arguments
- 3.1 Data source arguments: source-name, source-path, source-type and source-file (input-file)
- 4 Facets
- 4.1 Facet functionality
- 4.1.1 Facet creation on Squirro Server
- 4.1.1.1 config/sample_facets.json
- 4.1.2 Data formatting
- 4.1.1 Facet creation on Squirro Server
- 4.2 Facet attributes examples of usage
- 4.2.1 Wildcards
- 4.2.2 Skipping columns from source
- 4.2.3 Access Control
- 4.2.4 Adding columns in pipelets
- 4.2.5 Default fields
- 4.2.6 Columns containing separators
- 4.1 Facet functionality
- 5 Data manipulation
- 6 Complex job setup
Sample data
The data loader tool requires a tabular input format (CSV, Excel or database connection). All three data sources are explained below and there is an example for each.
Excel
The following sample Excel file is used to highlight the Squirro Data Loader capabilities for loading an Excel file. The file contains a single header row and three data rows.
Example:
Excel: spreadsheet preview is finally here!
You can now display this file as a spreadsheet, instead of a PDF. Re-upload interaction.xlsx to display the new preview. You can access the file here.
CSV
The following sample CSV file is used to highlight the Squirro Data Loader capabilities for loading a CSV file. The file contains a single header row and three data rows. The sample file uses the following formatting options:
Field delimiter: , (comma)
Text delimiter used for fields with multiple values: ; (semicolon)
Example:
InteractionID,InteractionSubject,InteractionType,Date,Notes,InternalAttendees,NoAtendees,EntittledUsers,DateModifiedOn
100AAA,Tesla,Phone Call,2015-4-23T00:00:00,note1,Joe Average;Jane Houseowner;Elon Musk,3,jdoe;jhouseowner;emusk,4/23/2015 16:58
102BBB,PayPal,Meeting,2015-4-24T00:00:00,note2,Joe Average; Elon Musk,2,jdoe;emusk,4/24/2015 16:58
105CCC,Tesla,Meeting,2015-4-25T00:00:00,note3,Joe Average; Jane Houseowner,2,jdoe;jhouseowner,4/25/2015 16:58
Database
The following database sample inserts are used to highlight the capabilities of the Data Loader regarding the load from a database table. The example below is for an SQL Server database.
The database table creation:
CREATE TABLE Interaction_test
(
InteractionId varchar(10) NOT NULL,
InteractionSubject varchar(500) NOT NULL,
InteractionType varchar(100) NOT NULL,
Date date,
Notes varchar(max),
InternalAttendees varchar(500),
NoAttendees int,
EntittledUsers varchar(100),
DateModifiedOn datetime NOT NULL
);
Example (DB INSERTS):
INSERT INTO Interaction_test
(InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn)
VALUES
('100AAA', 'Tesla', 'Phone Call', '4/23/2015', 'note1', 'Joe Average;Jane Houseowner;Elon Musk', 3, 'jdoe;jhouseowner;emusk', '4/23/2015 16:58');
INSERT INTO Interaction_test
(InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn)
VALUES
('102BBB', 'PayPal', 'Meeting', '4/24/2015', 'note2', 'Joe Average;Elon Musk', 2, 'jdoe;emusk', '4/24/2015 16:58');
INSERT INTO Interaction_test
(InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn)
VALUES
('105CCC', 'Tesla', 'Meeting', '4/25/2015', 'note3', 'Joe Average;Jane Houseowner', 2, 'jdoe;jhouseowner', '4/25/2015 16:58');
Data mapping
Some (or all) of the source columns can be mapped to the Squirro item fields using the command line arguments. In order for items to be loaded into Squirro, either title or body needs to be specified, all the other fields are optional. Also the --map-flag
and --map-id
parameters are mutually exclusive.
Field Name | Mandatory | Description | Example |
title | Yes* | Item title. | “Conversation1” |
abstract | No | Item summary. | “Phone call” |
created_at | No | Item creation date and time (see Date and Time). | 02/19/2016 |
id | No | External item identifier. Maximum length is 256 characters. | 100AAA |
body | Yes* | HTML markup with full representation of the item content. | Note1 |
url | No | URL which points to the item origin. | |
file_name | No | LOB file name | |
file_mime | No | LOB file mime type | |
file_data | No | ||
file_ compressed | No | ||
flag | No |
*It is mandatory that at least one of the title or body fields is mapped.
Usage
You can find a simple example below on basic mapping. For this example we used the interaction.csv file provided in the Sample Data section. We are mapping here some of the fields from the CSV file to the following item fields: title, id, abstract, created-at and 3 fields for the item body.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Simple usage and flow control
In this section we will describe and show the usage of the arguments influencing the load type.
Data source arguments: source-name, source-path, source-type and source-file (input-file)
There are three load examples below, one for each source type: CSV, Excel and database. The --source-name
argument is used to set the name of the source in Squirro. The --source-type
argument tells the Data Loader what kind of source is loading. If the --source-script
parameter is set, this will show the location of the DataSource python script. Note that the arguments --source-type
and --source-script
are mutually exclusive.
The --source-file
argument specifies the path to the file from which we will load the data. For the database load the --input-file is used instead to specify the sql statement used to retrieve data from the database.
CSV Example
Excel Example
SQL Example
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
interaction.sql (used for SQL example)
Note: the SQL file must contain only one query and it shouldn't end with semicolon (;)
In Squirro, in the Data tab, you can see the result of the loads from the three different sources:
Database reset and incremental-column arguments
For database loads an incremental load can be executed. In the example below, an incremental load is performed using the --incremental-column
argument to specify the date field used for incremental reference.
First run the initial load, which will be a full load.
Note: The full load doesn't contain the --incremental-column
argument. If an incremental load is executed after a full load, it will do an initial load, re-indexing everything - since there is no information on the last loaded row from the full load.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Another row was inserted in the meantime:
When re-executing incremental load, we should only have one row loaded into Squirro, the row that was just inserted in the database.
When using the --reset
argument (together with the --incremental-column
) the --incremental-column
argument will be ignored and a full load will be performed, thus all three rows will be loaded:
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
ItemUploader arguments: object-id, source-id, source-name, bulk-index, bulk-index-add-summary-from-body, batch-size
We can set these arguments in the command line and sent them directly to the ItemUploader module. They are not changed anywhere in the Data Loader. You can find more information on them in the ItemUploader documentation.
Usage for object-id, source-id and source-name
For this example we will use the CSV file again. In order to load the items into an existing source we should use one of the arguments: object-id
, source-id
or source-name
. When using the object-id
and source-id
arguments, first we need to get these ids from the cluster, using get_user_objects() and get_object_subscriptions(). That is, if we don’t already know them. Or we can use --source-name
which we can get in a more simple way, by checking the Data tab in Squirro.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Usage for bulk-index, bulk-index-add-summary-from-body, batch-size
The ItemUploader arguments bulk-index
and bulk-index-add-summary-from-body
are used for indexing data in bulk. The batch-size
argument will tell the ItemUploader the number of items to send in one request. This should be lower than 500 depending on the setup and it defaults to auto.
You can find more information on those in the ItemUploader documentation.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Parallel executors and source batch size arguments: parallel-uploaders, source-batch-size
Whenever using the --parallel-uploaders
parameter (default is 1) the Data Loader will load data in parallel. In the next usage example, --parallel-uploaders
is set to 5 and assuming we have 1000 rows to load, each executor will upload 200 rows. If we set the --source-batch-size
to 50 then the data is first split into batches consisting of 50 rows each (that is 20 batches) and then, each batch is distributed to 5 executors, thus each executor uploading 10 rows at a time until all rows are uploaded.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Metadata database related arguments
The SQLlite metadata database is used for the incremental loads and the locking mechanism. By default, the database file is created in the user home folder and has the name sq_data_load.db but this can be changed with the following parameters:
--meta-db-dir
sets the directory--meta-db-file
sets the name of the SQLlite metadata database
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Facets
In this chapter we will describe the use and functionality of the facets configuration file. The facets/keywords are used for filtering data in Squirro.
This file will subsequently be used for three scenarios:
Loading the facets/keywords into Squirro
Formatting the data coming from the source (for data types int, float and dates)
Mapping the source columns to the Squirro facets
The facet configuration file contains the attributes of the columns mapped as facets in Squirro and has a hjson format, which is a more relaxed and easier to read json format. To run the Data Loader script with the facet information, we need to use the --facets-file
parameter with the full path to the JSON file.
Facet functionality
Facet creation on Squirro Server
When creating or updating facets in Squirro, the following attributes are taken into account (and subsequently added/updated):
name
group_name
display_name
data_type
visible
searchable
format_instr
The attributes used for formatting of the item:
default_value
data_type
input_format_string
delimiter
Attribute used for skipping columns:
import
Attributes used for access control:
auth
auth_value_mandatory
delimiter
Pivotal attributes:
pivotal_group
delimiter
You can find more information on these attributes in the Data Loader Facet Config Reference.
Below is a basic example for the facets configuration file and the mapping between the source field and the facet names. We will add attributes to the file as we describe more and more functionality. The “name” attribute is the one that identifies a facet in Squirro and not the key of each facet dictionary, unless the “name” attribute is missing and then the facet is identified by the dictionary key. The key of each facet dictionary is in fact the field name from the data source.
config/sample_facets.json
And the command line call:
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
You can find the loaded facets in the Data >> Facets tab. You can see below the Facets tab before loading and after:
You can also see the new facets in the Search tab. Notice that the name of the selected facet is ‘Subject’ - the “display_name” attribute of the facet - coming from the "InteractionSubject" field in the data source which is the key of the facet dictionary. And the 2 distinct values of the subject field are shown: ‘Tesla’ and ‘PayPal’.
Data formatting
You may have noticed the formatting on the datetime data type with the "input_format_string" but there is even more formatting. The attribute "input_format_string" is applied to the data loaded into Squirro, more precisely to the date/datetime data types and the "format_instr" is applied on facet values in Squirro.
input_format_string: is the format of the date/datetime strings coming from the source (used for csv and excel, the date values coming from the database are of type datetime and don’t need a format).
format_instr: this is the formatting instruction for the facet value display. This parameter defines how the Squirro UI will display the value. This is also relevant for int and float data types. It’s also documented here.
Note that the "input_format_string" should have a Python date format and "format_instr" is just generic date formatting.
Other uses
Other uses of the facets config file are:
For Access Control support - “auth” attributes - True/False, default False. One or multiple columns from the source can be used for access control. The values of the specified columns are used to determine user access to the Squirro item.
For pivotal columns -“pivotal_group” attribute - if this attribute exists than the column is also a pivotal column and is related to all other columns having the same group
For specifying if a column will be loaded as a facet or not - "import" attribute - True/False, default True.
The use of these attributes will be exemplified in the following sections.
Facet attributes examples of usage
Wildcards
If there is a case where we want to use the same attribute for multiple facets, we can use the wildcard '*' (see example below) in the key facet name to describe them. This configuration will take all the columns that start with the word “Interaction” from the source and send them to Squirro. According to our sample CSV data these columns will be: InteractionId, InteractionSubject, InteractionType.
If used by itself, the wildcard will load all the remaining facets, the ones that are not specified by name/key in the facets config file. Also, the '*' can be used only at the end of the column name (e.g. Column* is OK, C^*lumn is NOT).
Another thing to keep in mind is that when using the wildcard on a series of similar fields, the attributes of the field(s) not covered by the expansion of the wildcard (i.e. the ones specified by their full name) are not overwritten by the wildcard attributes. For example, we defined in the config file the field "InteractionId" with “visible” and “searchable” attributes set to “true” and "Interaction*" with the same attributes set to “false”. After the expansion of the wildcard the "InteractionId" facet will still have “visible” and “searchable” attributes set to “true” and only "InteractionSubject" and "InteractionType" will have the same attributes set to “false”.
Skipping columns from source
If we do not want to load some of the columns from the source as facets or we are using these columns only for data manipulation, we can use the following:
“import”: false
{Facet_key: false} - shortcut for "import": false
Do not add the column at all in the facets config file
Note: the data from these columns will still be loaded into Squirro, just the names of this columns will not be loaded as facets/keywords.
In the example below, none of the two columns (Notes and InternalAttendees) will be loaded into Squirro as facets.
Access Control
Another use of “import”: false is with the access controlled columns since we don’t normally want these columns to show as keywords:
The "auth": true specifies that this column is used for acces control and “auth_value_mandatory”: true if value is empty the item will not be uploaded.
Adding columns in pipelets
The pipelets are mostly used in the load to alter or generate new data based on the input. The columns generated using the pipelets can be defined in the facets configuration file in the same way as a source column. The only difference is that the formatting (value splitting, date reformat, string to int/float conversions, applying default values) is done at the end of the process.
Note1: The pipelet’s consume method is called for each row, meaning that it has access only to that row – cannot aggregate data from multiple rows.
Note2: For the columns added from pipelets and not configured as facets, default facets of type string will be created on the first load.
Default fields
Another attribute that can be set for a column in the facets file is the default value. So, if there is a default value this will replace all the null values coming for the data source when the item is loaded into Squirro.
Note: Zero values are not replaced with the default value.
Columns containing separators
The “delimiter” attribute can be specified for certain columns that have multiple values. For example, in our CSV file the following two columns have multiple values: InternalAttendees and EntittledUsers. For both these columns the values are separated by a semicolon (“;”). In this case, the facets config file will look as follows:
Data manipulation
Templating and pivot columns
The body, title and summary/abstract of the Squirro item can be formatted. This is possible by applying a Jinja2 template and using templates files supplied as arguments. The following arguments can be supplied from the command line for templating:
--body-template-file
--title-template-file
--abstract-template-file
From the Jinja templates, we have access to the source data using 3 variables:
Variable Name | Type | Description |
---|---|---|
row | Python dictionary | Data as extracted from the source on format {key, value} key is the column name and value is the value of the key on one row Example accessing the value of a column: |
item | Python dictionary | State of the Squirro item after applying the mapping, data conversions and before pipelets. Example accessing a facet from the item: |
pivot_data | Python dictionary | Computed pivot data: If 2 tables are joined on a relation of 1 to n, the n rows are pivoted to generate only one line on the source. The pivot logic on the Data Loader recreates the rows from the joined table to use them in the Jinja template. The source extraction logic can be very complex and multiple joins with pivoted columns can be used. In order to differentiate columns from different the joins, the tool uses groups. if 2 columns are from the same group, it means that they were selected from the same joined table. Example: Source tables joined on InteractionID column: Result is: Data Loader pivot_data for the first line is: The pivot columns are described using the facets configuration file. The attributes describing a column as pivotal are:
Example of usage in the template html file: |
For this example we will use the CSV file again and we will format the body of the item using the --body-template-file argument.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Note: the name that we use for the facet in the “name” attribute in the facets config file has to be identical with the name from item.keywords tag in the template file
The HTML template file:
And the output in Squirro looks like this:
We can also have pivot columns in our data source. This script has the ability to expand these columns and use them to enrich the Squirro item using the templating functionality.
The original CSV source file was slightly modified to accommodate two pivotal columns: Notes and InternalAtendees.
In the facets config file, the two column descriptions look like below:
config/sample_facets.json
And the template file:
In Squirro, the items with pivoted columns will look like this:
Pipelets
This data loader has the ability to run multiple pipelets before or after the templating step. The pipelets are external Python classes that are able to modify the Squirro item. The pipelets are instantiated once for each parallel process, but the code inside is executed once for each item (it is not recommended to open files in the consume method of the pipelet, you have to do it in init()).
The pipelets location, input arguments and execution order is described in a json file supplied as an argument for the loader script --pipelets-file
. You can find more information on the attributes used in this file in the Data Loader Facet Config Reference. Also, the --pipelets-error-behavior
specifies the jobs behaviour in case a pipelet rises an exception. By default is “error”. In case of error, the load will stop. The other option is "warning", which will only log the warning and continue the load.
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
We will use the following pipelets config file:
config/sample_pipelets.json
And in Python pipelet script, consume method just adds the name ‘Squirro’ to the title of the item :
sample_pipelet.py
The result of our simple pipelet is shown below:
Complex job setup
You can find below a more complex example of a database load along with the configuration files, the template file, the python module and the command line parameters.
In this example we will incrementally load a few rows from a database table, using 2 parallel executors. We will be using a template file with a pivotal table for formatting the item body. The id, abstract, created-at and title of the body will be mapped from the command line arguments. The facets will be configured in the json configuration file: sample_facets_pivot.json. We will also create a simple pipelet that adds the keyword "comments" tom the item and assigns a text to it in the item body. The SQLite metadata database will also be set from the command line.
The database source:
The command line:
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
The contents of the interactions.sql:
interactions.sql
The template Jinja2 file is shown below. The column "comments", coming from the pipelet is added to the item body.
template/template_body_interaction.html
The complete facets configuration json file. Notice how the column "comments", comming from the pipelet was added in this json file, without being present in the source.
config/sample_facets.json
And the pipelets configuration file:
config/sample_pipelets.json
The consume() method for the pipelet:
sample_pipelet.py
The output in Squirrro:
And the created facets:
And finally the contents of the log file, log.txt: