...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Excerpt |
---|
This page shows a number of examples for using the Data Loader. |
For a full reference documentation please see Data Loader Reference. For a step-by-step tutorial, please refer to the Data Loader Tutorial.
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
Sample data
The data loader tool requires a tabular input format (CSV, Excel or database connection). All three data sources are explained below and there is an example for each.
...
The following sample Excel file is used to highlight the Squirro Data Loader capabilities for loading an Excel file. The file contains a single header row and three data rows.
Example:
Office Excel | ||
---|---|---|
|
...
CSV
The following sample CSV file is used to highlight the Squirro Data Loader capabilities for loading a CSV file. The file contains a single header row and three data rows. The sample file uses the following formatting options:
Field delimiter: , (comma)
Text delimiter used for fields with multiple values: ; (semicolon)
Example:
Code Block | ||
---|---|---|
| ||
InteractionID,InteractionSubject,InteractionType,Date,Notes,InternalAttendees,NoAtendees,EntittledUsers,DateModifiedOn 100AAA,Tesla,Phone Call,2015-4-23T00:00:00,note1,Joe Average;Jane Houseowner;Elon Musk,3,jdoe;jhouseowner;emusk,4/23/2015 16:58 102BBB,PayPal,Meeting,2015-4-24T00:00:00,note2,Joe Average; Elon Musk,2,jdoe;emusk,4/24/2015 16:58 105CCC,Tesla,Meeting,2015-4-25T00:00:00,note3,Joe Average; Jane Houseowner,2,jdoe;jhouseowner,4/25/2015 16:58 |
...
Database
The following database sample inserts are used to highlight the capabilities of the Data Loader regarding the load from a database table. The example below is for an SQL Server database.
...
Code Block | ||
---|---|---|
| ||
INSERT INTO Interaction_test (InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn) VALUES ('100AAA', 'Tesla', 'Phone Call', '4/23/2015', 'note1', 'Joe Average;Jane Houseowner;Elon Musk', 3, 'jdoe;jhouseowner;emusk', '4/23/2015 16:58'); INSERT INTO Interaction_test (InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn) VALUES ('102BBB', 'PayPal', 'Meeting', '4/24/2015', 'note2', 'Joe Average;Elon Musk', 2, 'jdoe;emusk', '4/24/2015 16:58'); INSERT INTO Interaction_test (InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn) VALUES ('105CCC', 'Tesla', 'Meeting', '4/25/2015', 'note3', 'Joe Average;Jane Houseowner', 2, 'jdoe;jhouseowner', '4/25/2015 16:58'); |
...
Data mapping
Some (or all) of the source columns can be mapped to the Squirro item fields using the command line arguments. In order for items to be loaded into Squirro, either title or body needs to be specified, all the other fields are optional. Also the --map-flag
and --map-id
parameters are mutually exclusive.
Field Name | Mandatory | Description | Example |
title | Yes* | Item title. | “Conversation1” |
abstract | No | Item summary. | “Phone call” |
created_at | No | Item creation date and time (see Date and Time). | 02/19/2016 |
id | No | External item identifier. Maximum length is 256 characters. | 100AAA |
body | Yes* | HTML markup with full representation of the item content. | Note1 |
url | No | URL which points to the item origin. | |
file_name | No | LOB file name | |
file_mime | No | LOB file mime type | |
file_data | No | ||
file_ compressed | No | ||
flag | No |
*It is mandatory that at least one of the title or body fields is mapped.
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name mapping ^ --source-type csv ^ --map-title InteractionSubject ^ --map-id InteractionID ^ --map-abstract Notes ^ --map-body InteractionType InternalAttendees NoAtendees ^ --map-created-at Date ^ --source-file interaction.csv |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
...
Simple usage and flow control
...
The --source-file
argument specifies the path to the file from which we will load the data. For the database load the --input-file is used instead to specify the sql statement used to retrieve data from the database.
CSV Example
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name csv_source ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv |
Excel Example
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name excel_source ^ --source-type excel ^ --map-title InteractionSubject ^ --source-file interaction.xlsx |
SQL Example
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --db-connection %db_connection% ^ --source-name db_source ^ --source-type database ^ --map-title InteractionSubject ^ --input-file interaction.sql |
...
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
...
...
interaction.sql (used for SQL example)
Code Block | ||
---|---|---|
| ||
SELECT * FROM Interaction_test WHERE InteractionSubject = 'Tesla' |
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --db-connection %db_connection% ^ --project-id %project_id% ^ --source-name db_source ^ --source-type database ^ --incremental-column Date ^ --map-title InteractionSubject ^ --input-file interaction.sql |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Another row was inserted in the meantime:
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --db-connection %db_connection% ^ --project-id %project_id% ^ --source-name db_source ^ --source-type database ^ --incremental-column Date ^ --reset ^ --map-title InteractionSubject ^ --input-file interaction.sql |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
ItemUploader arguments: object-id, source-id, source-name, bulk-index, bulk-index-add-summary-from-body, batch-size
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --object-id 3rd8UDYpQwS8vvFoT3JVWA ^ --source-id yYPR8DA1SPyhTbEXIH0HYw ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Usage for bulk-index, bulk-index-add-summary-from-body, batch-size
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name bulk_indexing_source ^ --source-type csv ^ --bulk-index ^ --bulk-index-add-summary-from-body ^ --batch-size 70 ^ --map-title InteractionSubject ^ --map-body Notes ^ --source-file interaction.csv |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Parallel executors and source batch size arguments: parallel-uploaders, source-batch-size
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name csv_source ^ --parallel-uploaders 5 ^ --source-batch-size 50 ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Metadata database related arguments
The SQLlite metadata database is used for the incremental loads and the locking mechanism. By default, the database file is created in the user home folder and has the name sq_data_load.db but this can be changed with the following parameters:
--meta-db-dir
sets the directory--meta-db-file
sets the name of the SQLlite metadata database
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name csv_source ^ --meta-db-dir %meta_dir% ^ --meta-db-file %meta_file% ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Facets
In this chapter we will describe the use and functionality of the facets configuration file. The facets/keywords are used for filtering data in Squirro.
This file will subsequently be used for three scenarios:
Loading the facets/keywords into Squirro
Formatting the data coming from the source (for data types int, float and dates)
Mapping the source columns to the Squirro facets
The facet configuration file contains the attributes of the columns mapped as facets in Squirro and has a hjson format, which is a more relaxed and easier to read json format. To run the Data Loader script with the facet information, we need to use the --facets-file
parameter with the full path to the JSON file.
...
When creating or updating facets in Squirro, the following attributes are taken into account (and subsequently added/updated):
name
group_name
display_name
data_type
visible
searchable
format_instr
The attributes used for formatting of the item:
default_value
data_type
input_format_string
delimiter
Attribute used for skipping columns:
import
Attributes used for access control:
auth
auth_value_mandatory
delimiter
Pivotal attributes:
pivotal_group
delimiter
You can find more information on these attributes in the Data Loader Facet Config Reference.
Below is a basic example for the facets configuration file and the mapping between the source field and the facet names. We will add attributes to the file as we describe more and more functionality. The “name” attribute is the one that identifies a facet in Squirro and not the key of each facet dictionary, unless the “name” attribute is missing and then the facet is identified by the dictionary key. The key of each facet dictionary is in fact the field name from the data source.
...
...
config/sample_facets.json
Code Block | ||
---|---|---|
| ||
{ "InteractionSubject": { "name": "Interaction Subject", "display_name": "Subject", "group_name": "Interactions", "visible": true, "searchable": true }, "InteractionType": { "name": "Type of Interaction", "display_name": "Type", "group_name": "Interactions", "visible": true, "searchable": true }, "Date": { "name": "Date of creation", "data_type": "datetime", "input_format_string": "%Y-%m-%dT%H:%M:%S", "visible": true }, "Notes": { "name": "Notes", "visible": false, "searchable": true }, "InternalAttendees": { "name": "Attendees", "group_name": "Interactions" "visible": true }, "NoAtendees": { "name": "Number of Attendees", "data_type": "int", "visible": true "searchable": false }, } |
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name csv_interactions ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv ^ --facets-file config/sample_facets.json |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
You can find the loaded facets in the Data >> Facets tab. You can see below the Facets tab before loading and after:
...
You may have noticed the formatting on the datetime data type with the "input_format_string" but there is even more formatting. The attribute "input_format_string" is applied to the data loaded into Squirro, more precisely to the date/datetime data types and the "format_instr" is applied on facet values in Squirro.
input_format_string: is the format of the date/datetime strings coming from the source (used for csv and excel, the date values coming from the database are of type datetime and don’t need a format).
format_instr: this is the formatting instruction for the facet value display. This parameter defines how the Squirro UI will display the value. This is also relevant for int and float data types. It’s also documented here.
Code Block | ||
---|---|---|
| ||
"Date": { "name": "Date of creation", "data_type": "datetime", "input_format_string": "%Y-%m-%dT%H:%M:%S", "format_instr": "YYYY-MM-DD", "visible": true } |
Note that the "input_format_string" should have a Python date format and "format_instr" is just generic date formatting.
Other uses
Other uses of the facets config file are:
For Access Control support - “auth” attributes - True/False, default False. One or multiple columns from the source can be used for access control. The values of the specified columns are used to determine user access to the Squirro item.
For pivotal columns -“pivotal_group” attribute - if this attribute exists than the column is also a pivotal column and is related to all other columns having the same group
For specifying if a column will be loaded as a facet or not - "import" attribute - True/False, default True.
The use of these attributes will be exemplified in the following sections.
...
If we do not want to load some of the columns from the source as facets or we are using these columns only for data manipulation, we can use the following:
“import”: false
{Facet_key: false} - shortcut for "import": false
Do not add the column at all in the facets config file
Note: the data from these columns will still be loaded into Squirro, just the names of this columns will not be loaded as facets/keywords.
...
The body, title and summary/abstract of the Squirro item can be formatted. This is possible by applying a Jinja2 template and using templates files supplied as arguments. The following arguments can be supplied from the command line for templating:
--body-template-file
--title-template-file
--abstract-template-file
From the Jinja templates, we have access to the source data using 3 variables:
Variable Name | Type | Description | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
row | Python dictionary | Data as extracted from the source on format {key, value} key is the column name and value is the value of the key on one row Example accessing the value of a column:
| ||||||||||||||||||||||||
item | Python dictionary | State of the Squirro item after applying the mapping, data conversions and before pipelets. Example accessing a facet from the item:
| ||||||||||||||||||||||||
pivot_data | Python dictionary | Computed pivot data: If 2 tables are joined on a relation of 1 to n, the n rows are pivoted to generate only one line on the source. The pivot logic on the Data Loader recreates the rows from the joined table to use them in the Jinja template. The source extraction logic can be very complex and multiple joins with pivoted columns can be used. In order to differentiate columns from different the joins, the tool uses groups. if 2 columns are from the same group, it means that they were selected from the same joined table. Example: Source tables joined on InteractionID column:
Result |
is:
Data Loader pivot_data for the first line is:
The pivot columns are described using the facets configuration file. The attributes describing a column as pivotal are:
Example of usage in the template html file:
|
For this example we will use the CSV file again and we will format the body of the item using the --body-template-file argument.
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name csv_interactions ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv ^ --facets-file config/sample_facets.json ^ --body-template-file template/template_body_interaction.html |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
Note: the name that we use for the facet in the “name” attribute in the facets config file has to be identical with the name from item.keywords tag in the template file
...
In the facets config file, the two column descriptions look like below:
...
config/sample_facets.json
Code Block | ||
---|---|---|
| ||
"Notes": { "name": "Notes", "pivotal_group": "Members", "delimiter": ";" "visible": false, "searchable":true }, "InternalAttendees": { "name": "Attendees", "group_name": "Interactions" "pivotal_group": "Members", "delimiter": ";" "visible": true } |
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --source-name csv_interactions ^ --source-type csv ^ --map-title InteractionSubject ^ --source-file interaction.csv ^ --facets-file config/sample_facets.json ^ --body-template-file template/template_body_interaction.html ^ --pipelets-error-behavior error ^ --pipelets-file config/sample_pipelets.json |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
We will use the following pipelets config file:
...
config/sample_pipelets.json
Code Block | ||
---|---|---|
| ||
{ "Sample": { "file_location": "sample_pipelet.py", "config": {} } } |
And in Python pipelet script, consume method just adds the name ‘Squirro’ to the title of the item :
sample_pipelet.py
Code Block | ||||
---|---|---|---|---|
| ||||
import squirro.sdk VERSION = '0.0.1' @squirro.sdk.require('log') class Sample(squirro.sdk.PipeletV1): def __init__(self, config): self.config = config def consume(self, item): item['title'] = item.get('title', '') + ' - Squirro' return item |
The result of our simple pipelet is shown below:
Complex job setup
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token %token% ^ --cluster %cluster% ^ --project-id %project_id% ^ --db-connection %db_connection% ^ --log-file log.txt ^ --parallel-uploaders 2 ^ --source-name db_interactions ^ --input-file interaction.sql ^ --source-type database ^ --source-batch-size 500 ^ --incremental-column DateModifiedOn ^ --map-title InteractionSubject ^ --map-id InteractionId ^ --map-created-at DateModifiedOn ^ --map-abstract InteractionType ^ --facets-file config/sample_facets.json ^ --body-template-file template/template_body_interaction.html ^ --pipelets-error-behavior error ^ --pipelets-file config/sample_pipelets.json ^ --meta-db-dir meta ^ --meta-db-file meta_interactions.db |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
The contents of the interactions.sql:
interactions.sql
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT * FROM Interaction_test WHERE InteractionSubject = 'Tesla' |
The template Jinja2 file is shown below. The column "comments", coming from the pipelet is added to the item body.
...
...
template/template_body_interaction.html
Code Block | ||
---|---|---|
| ||
<html> <body> <table> <tr> <th> Type of Interaction </th> <td> {{item.keywords["Type of Interaction"][0]}} </td> </tr> <tr> <th> Comments </th> <td> {{item.keywords["comments"][0]}} </td> </tr> </table> <h4>Members</h4> </br> <table class="content"> <tr> <th><b>InternalAttendees</b></th> <th><b>Notes</b></th> </tr> {% if pivot_data%} {% for line in pivot_data["Members"] %} <tr> <td>{{line["InternalAttendees"]}}</td> <td>{{line["Notes"]}}</td> </tr> {% endfor %} {% endif %} </table> </body> </html> |
The complete facets configuration json file. Notice how the column "comments", comming from the pipelet was added in this json file, without being present in the source.
...
...
config/sample_facets.json
Code Block | ||
---|---|---|
| ||
{ "InteractionSubject": { "name": "Interaction Subject", "display_name": "Subject", "group_name": "Interactions", "visible": true, "searchable":true }, "InteractionType": { "name": "Type of Interaction", "display_name": "Type", "group_name": "Interactions", "visible": true, "searchable":true }, "DateModifiedOn": { "name": "Date of modification", "data_type": "datetime", "input_format_string": "%m/%d/%Y %H:%M", "format_instr": "YYYY-MM-DD", "visible": true }, "Notes": { "name": "Notes", "pivotal_group": "Members", "delimiter": ";" "visible": false, "searchable":true }, "InternalAttendees": { "name": "Attendees", "group_name": "Interactions" "pivotal_group": "Members", "delimiter": ";" "visible": true }, "NoAttendees": { "name": "Number of Attendees", "data_type": "int", "visible": true "searchable":false }, "comments": { "name": "Interaction comments", "display_name": "Comment", "data_type": "string", "visible": true, "searchable":true }, "EntittledUsers": { "name": "Users", "auth": true "auth_value_mandatory": true, "import": false }, } |
And the pipelets configuration file:
...
...
config/sample_pipelets.json
Code Block | ||
---|---|---|
| ||
{ "Sample": { "file_location": "sample_pipelet.py", "config": {} } } |
The consume() method for the pipelet:
sample_pipelet.py
Code Block | ||||
---|---|---|---|---|
| ||||
def consume(self, item): # Add new keyword, facet item['keywords']['comments'] = [u'This is a test load.'] return item |
...
And finally the contents of the log file, log.txt:
log.txt
Code Block | ||||
---|---|---|---|---|
| ||||
2016-03-08 15:43:28,198 sq_data_load.py[12609] INFO Starting process (version 0.3.18). 2016-03-08 15:43:28,206 sq_data_load.py[12609] INFO Add module sample_pipelet.py to path 2016-03-08 15:43:28,206 sq_data_load.py[12609] INFO Start load from database 2016-03-08 15:43:28,825 sources.database INFO Execute initial load (full) for incremental processing 2016-03-08 15:43:32,319 sq_data_load.py[12609] INFO Start loading/deleting into/from Squirro 4 rows with 2 parallel executors 2016-03-08 15:43:33,072 sq_data_load.py[12609] WARNING One of the fields "link", "id" or "summary" is missing for "3" items 2016-03-08 15:43:33,075 sq_data_load.py[12609] INFO Intermediate number of rows loaded to Squirro: 3 2016-03-08 15:43:33,197 sq_data_load.py[12609] WARNING One of the fields "link", "id" or "summary" is missing for "1" items 2016-03-08 15:43:33,200 sq_data_load.py[12609] INFO Intermediate number of rows loaded to Squirro: 4 2016-03-08 15:43:33,238 sq_data_load.py[12609] INFO Total rows loaded to Squirro: 4 2016-03-08 15:43:33,240 sq_data_load.py[12609] INFO Last incremental value loaded to Squirro: 2015-04-28T16:58:00 2016-03-08 15:43:33,302 sq_data_load.py[12609] INFO Total run time: 0:00:05.108269 |
...