Page Comparison

...

Excerpt
This page shows a number of examples for using the Data Loader.

For a full reference documentation please see Data Loader Reference. For a step-by-step tutorial, please refer to the Data Loader Tutorial.

Sample data

The data loader tool requires a tabular input format (CSV, Excel or database connection). All three data sources are explained below and there is an example for each.

...

The following sample Excel file is used to highlight the Squirro Data Loader capabilities for loading an Excel file. The file contains a single header row and three data rows.

Example:

Office Excel

name	interaction.xlsx

...

CSV

The following sample CSV file is used to highlight the Squirro Data Loader capabilities for loading a CSV file. The file contains a single header row and three data rows. The sample file uses the following formatting options:

Field delimiter: , (comma)
Text delimiter used for fields with multiple values: ; (semicolon)

Example:

Code Block

language	text

InteractionID,InteractionSubject,InteractionType,Date,Notes,InternalAttendees,NoAtendees,EntittledUsers,DateModifiedOn
100AAA,Tesla,Phone Call,2015-4-23T00:00:00,note1,Joe Average;Jane Houseowner;Elon Musk,3,jdoe;jhouseowner;emusk,4/23/2015 16:58
102BBB,PayPal,Meeting,2015-4-24T00:00:00,note2,Joe Average; Elon Musk,2,jdoe;emusk,4/24/2015 16:58
105CCC,Tesla,Meeting,2015-4-25T00:00:00,note3,Joe Average; Jane Houseowner,2,jdoe;jhouseowner,4/25/2015 16:58

...

Database

The following database sample inserts are used to highlight the capabilities of the Data Loader regarding the load from a database table. The example below is for an SQL Server database.

...

Code Block

language	sql

INSERT INTO Interaction_test
(InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn)
VALUES
('100AAA', 'Tesla', 'Phone Call', '4/23/2015', 'note1', 'Joe Average;Jane Houseowner;Elon Musk', 3, 'jdoe;jhouseowner;emusk', '4/23/2015 16:58');

INSERT INTO Interaction_test
(InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn)
VALUES
('102BBB', 'PayPal', 'Meeting', '4/24/2015', 'note2', 'Joe Average;Elon Musk', 2, 'jdoe;emusk', '4/24/2015 16:58');

INSERT INTO Interaction_test
(InteractionId, InteractionSubject, InteractionType, Date, Notes, InternalAttendees, NoAttendees , EntittledUsers, DateModifiedOn)
VALUES
('105CCC', 'Tesla', 'Meeting', '4/25/2015', 'note3',  'Joe Average;Jane Houseowner', 2, 'jdoe;jhouseowner', '4/25/2015 16:58');

...

Data mapping

Some (or all) of the source columns can be mapped to the Squirro item fields using the command line arguments. In order for items to be loaded into Squirro, either title or body needs to be specified, all the other fields are optional. Also the --map-flag and --map-id parameters are mutually exclusive.

Field Name	Mandatory	Description	Example
title	Yes*	Item title.	“Conversation1”
abstract	No	Item summary.	“Phone call”
created_at	No	Item creation date and time (see Date and Time).	02/19/2016
id	No	External item identifier. Maximum length is 256 characters.	100AAA
body	Yes*	HTML markup with full representation of the item content.	Note1
url	No	URL which points to the item origin.	``example.com``
file_name	No	LOB file name
file_mime	No	LOB file mime type
file_data	No
file_ compressed	No
flag	No

*It is mandatory that at least one of the title or body fields is mapped.

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name mapping ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --map-id InteractionID ^
    --map-abstract Notes ^
    --map-body InteractionType InternalAttendees NoAtendees ^
    --map-created-at Date ^
    --source-file interaction.csv

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.Image Removed

...

Simple usage and flow control

...

The --source-file argument specifies the path to the file from which we will load the data. For the database load the --input-file is used instead to specify the sql statement used to retrieve data from the database.

CSV Example

Code Block

language	powershell
title	CSV Example
collapse	true

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_source ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv

Excel Example

Code Block

language	powershell
title	Excel Example
collapse	true

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name excel_source ^
    --source-type excel ^
    --map-title InteractionSubject ^
    --source-file interaction.xlsx

SQL Example

Code Block

collapse

language	powershell
title	SQL Example
	true

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --db-connection %db_connection% ^
    --source-name db_source ^
    --source-type database ^
    --map-title InteractionSubject ^
    --input-file interaction.sql

...

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

...

interaction.sql (used for SQL example)

Code Block

collapselanguage	truesql

SELECT * 
FROM Interaction_test
WHERE InteractionSubject = 'Tesla'

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --db-connection %db_connection% ^
    --project-id %project_id% ^
    --source-name db_source ^
    --source-type database ^
    --incremental-column Date ^
    --map-title InteractionSubject ^
    --input-file interaction.sql

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Another row was inserted in the meantime:

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --db-connection %db_connection% ^
    --project-id %project_id% ^
    --source-name db_source ^
    --source-type database ^
    --incremental-column Date ^
    --reset ^
    --map-title InteractionSubject ^
    --input-file interaction.sql

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

ItemUploader arguments: object-id, source-id, source-name, bulk-index, bulk-index-add-summary-from-body, batch-size

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --object-id 3rd8UDYpQwS8vvFoT3JVWA ^
    --source-id yYPR8DA1SPyhTbEXIH0HYw ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Usage for bulk-index, bulk-index-add-summary-from-body, batch-size

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name bulk_indexing_source ^
    --source-type csv ^
    --bulk-index ^
    --bulk-index-add-summary-from-body ^
    --batch-size 70 ^
    --map-title InteractionSubject ^
    --map-body Notes ^
    --source-file interaction.csv

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Parallel executors and source batch size arguments: parallel-uploaders, source-batch-size

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_source ^
    --parallel-uploaders 5 ^
    --source-batch-size 50 ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Metadata database related arguments

The SQLlite metadata database is used for the incremental loads and the locking mechanism. By default, the database file is created in the user home folder and has the name sq_data_load.db but this can be changed with the following parameters:

--meta-db-dir sets the directory
--meta-db-file sets the name of the SQLlite metadata database

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_source ^
    --meta-db-dir %meta_dir% ^
    --meta-db-file %meta_file% ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Facets

In this chapter we will describe the use and functionality of the facets configuration file. The facets/keywords are used for filtering data in Squirro.

This file will subsequently be used for three scenarios:

Loading the facets/keywords into Squirro
Formatting the data coming from the source (for data types int, float and dates)
Mapping the source columns to the Squirro facets

The facet configuration file contains the attributes of the columns mapped as facets in Squirro and has a hjson format, which is a more relaxed and easier to read json format. To run the Data Loader script with the facet information, we need to use the --facets-file parameter with the full path to the JSON file.

...

When creating or updating facets in Squirro, the following attributes are taken into account (and subsequently added/updated):

name
group_name
display_name
data_type
visible
searchable
format_instr

The attributes used for formatting of the item:

default_value
data_type
input_format_string
delimiter

Attribute used for skipping columns:

import

Attributes used for access control:

auth
auth_value_mandatory
delimiter

Pivotal attributes:

pivotal_group
delimiter

You can find more information on these attributes in the Data Loader Facet Config Reference.

Below is a basic example for the facets configuration file and the mapping between the source field and the facet names. We will add attributes to the file as we describe more and more functionality. The “name” attribute is the one that identifies a facet in Squirro and not the key of each facet dictionary, unless the “name” attribute is missing and then the facet is identified by the dictionary key. The key of each facet dictionary is in fact the field name from the data source.

...

config/sample_facets.json

Code Block

language	js

{   
    "InteractionSubject": {
        "name": "Interaction Subject",
        "display_name": "Subject",
        "group_name": "Interactions",
        "visible": true,
        "searchable": true
    },
     "InteractionType": {
        "name": "Type of Interaction",
        "display_name": "Type",
        "group_name": "Interactions",
        "visible": true,
        "searchable": true
    },
    "Date": {
        "name": "Date of creation",
        "data_type": "datetime",
        "input_format_string": "%Y-%m-%dT%H:%M:%S",
        "visible": true
    },
    "Notes": {
        "name": "Notes",
        "visible": false,
        "searchable": true
    },
    "InternalAttendees": {
        "name": "Attendees",
        "group_name": "Interactions"
        "visible": true
    },
    "NoAtendees": {
        "name": "Number of Attendees",
        "data_type": "int",
        "visible": true
        "searchable": false
     },
}

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_interactions ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv ^
    --facets-file config/sample_facets.json

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

You can find the loaded facets in the Data >> Facets tab. You can see below the Facets tab before loading and after:

...

You may have noticed the formatting on the datetime data type with the "input_format_string" but there is even more formatting. The attribute "input_format_string" is applied to the data loaded into Squirro, more precisely to the date/datetime data types and the "format_instr" is applied on facet values in Squirro.

input_format_string: is the format of the date/datetime strings coming from the source (used for csv and excel, the date values coming from the database are of type datetime and don’t need a format).
format_instr: this is the formatting instruction for the facet value display. This parameter defines how the Squirro UI will display the value. This is also relevant for int and float data types. It’s also documented here.

Code Block

language	js

    "Date": {
        "name": "Date of creation",
        "data_type": "datetime",
        "input_format_string": "%Y-%m-%dT%H:%M:%S",
        "format_instr": "YYYY-MM-DD",
        "visible": true
    }

Note that the "input_format_string" should have a Python date format and "format_instr" is just generic date formatting.

Image ModifiedOther uses

Other uses of the facets config file are:

For Access Control support - “auth” attributes - True/False, default False. One or multiple columns from the source can be used for access control. The values of the specified columns are used to determine user access to the Squirro item.
For pivotal columns -“pivotal_group” attribute - if this attribute exists than the column is also a pivotal column and is related to all other columns having the same group
For specifying if a column will be loaded as a facet or not - "import" attribute - True/False, default True.

The use of these attributes will be exemplified in the following sections.

...

If we do not want to load some of the columns from the source as facets or we are using these columns only for data manipulation, we can use the following:

“import”: false
{Facet_key: false} - shortcut for "import": false
Do not add the column at all in the facets config file

Note: the data from these columns will still be loaded into Squirro, just the names of this columns will not be loaded as facets/keywords.

...

The body, title and summary/abstract of the Squirro item can be formatted. This is possible by applying a Jinja2 template and using templates files supplied as arguments. The following arguments can be supplied from the command line for templating:

--body-template-file
--title-template-file
--abstract-template-file

From the Jinja templates, we have access to the source data using 3 variables:

Variable Name

Type

Description

row

Python dictionary

Data as extracted from the source on format {key, value} key is the column name and value is the value of the key on one row

Example accessing the value of a column:

Code Block
{{row["InteractionSubject"]}}

item

Python dictionary

State of the Squirro item after applying the mapping, data conversions and before pipelets.

Example accessing a facet from the item:

Code Block
{{item.keywords["InteractionSubject"][0]}}

pivot_data

Python dictionary

Computed pivot data:

If 2 tables are joined on a relation of 1 to n, the n rows are pivoted to generate only one line on the source.

The pivot logic on the Data Loader recreates the rows from the joined table to use them in the Jinja template.

The source extraction logic can be very complex and multiple joins with pivoted columns can be used. In order to differentiate columns from different the joins,

the tool uses groups. if 2 columns are from the same group, it means that they were selected from the same joined table.

Example:

Source tables joined on InteractionID column:

InteractionID	InteractionSubject
100AAA	Tesla
102BBB	Paypal

InteractionID	Notes	InternalAtendees
100AAA	note1	Joe Average
100AAA	note11	Jane Houseowner
100AAA	note12	Elon Musk
102BBB	note2	Joe Average
102BBB	note21	Elon Musk

^Result

^is:

Code Block

language	text

InteractionID,InteractionSubject,Notes,InternalAttendees
100AAA,Tesla,note1;note11;note12,Joe Average;Jane Houseowner;Elon Musk
102BBB,PayPal,note2;note21,Joe Average;Elon Musk

Data Loader pivot_data for the first line is:

Code Block

language	py

{"group name": 
    [
		{"Notes": "note1", "InternalAttendees": "Joe Average"},
	    {"Notes": "note11", "InternalAttendees": "Jane Houseowner"},
    	{"Notes": "note12", "InternalAttendees": "Elon Musk"}
	]
}

The pivot columns are described using the facets configuration file. The attributes describing a column as pivotal are:

pivotal_group: the name of the group
delimiter: usually the pivot columns must be split (we covered this attribute in the section Columns containing separators)

Example of usage in the template html file:

Code Block

<table>
    <tr>
        <th>Notes</th>
        <th>InternalAttendees</th>
    </tr>
    {% for line in pivot_data["group name"] %} 
        <tr>
            <th><b>{{line["Notes"]}}</b></th>
            <th><b>{{line["InternalAttendees"]}}</b></th>
        </tr>
    {% endfor %}
</table>

For this example we will use the CSV file again and we will format the body of the item using the --body-template-file argument.

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_interactions ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv ^
    --facets-file config/sample_facets.json ^
    --body-template-file template/template_body_interaction.html

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

Note: the name that we use for the facet in the “name” attribute in the facets config file has to be identical with the name from item.keywords tag in the template file

...

In the facets config file, the two column descriptions look like below:

...

config/sample_facets.json

Code Block

language	js

	"Notes": {
        "name": "Notes",
        "pivotal_group": "Members",
        "delimiter": ";"
        "visible": false,
        "searchable":true
    },
    "InternalAttendees": {
        "name": "Attendees",
        "group_name": "Interactions"
        "pivotal_group": "Members",
        "delimiter": ";"
        "visible": true
	}

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --source-name csv_interactions ^
    --source-type csv ^
    --map-title InteractionSubject ^
    --source-file interaction.csv ^
    --facets-file config/sample_facets.json ^
    --body-template-file template/template_body_interaction.html ^
    --pipelets-error-behavior error ^
    --pipelets-file config/sample_pipelets.json

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

We will use the following pipelets config file:

...

config/sample_pipelets.json

Code Block

language	js

{
    "Sample": {
        "file_location": "sample_pipelet.py",
        "config": {}
    }
}

And in Python pipelet script, consume method just adds the name ‘Squirro’ to the title of the item :

sample_pipelet.py

Code Block

language	py	title	sample_pipelet.py

import squirro.sdk
VERSION = '0.0.1'

@squirro.sdk.require('log')

class Sample(squirro.sdk.PipeletV1):

    def __init__(self, config):
        self.config = config

    def consume(self, item):
        item['title'] = item.get('title', '') + ' - Squirro'
        return item

The result of our simple pipelet is shown below:

Image Modified

Complex job setup

...

Code Block

language	powershell

squirro_data_load -v ^
    --token %token% ^
    --cluster %cluster% ^
    --project-id %project_id% ^
    --db-connection %db_connection% ^
    --log-file log.txt ^
    --parallel-uploaders 2 ^
    --source-name db_interactions ^
    --input-file interaction.sql ^
    --source-type database ^
    --source-batch-size 500 ^
    --incremental-column DateModifiedOn ^
    --map-title InteractionSubject ^
    --map-id InteractionId ^
    --map-created-at DateModifiedOn ^
    --map-abstract InteractionType ^
    --facets-file config/sample_facets.json ^
    --body-template-file template/template_body_interaction.html ^
    --pipelets-error-behavior error ^
    --pipelets-file config/sample_pipelets.json ^
    --meta-db-dir meta ^
    --meta-db-file meta_interactions.db

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

The contents of the interactions.sql:

interactions.sql

Code Block

language	sql	title	interactions.sql

SELECT *
FROM Interaction_test
WHERE InteractionSubject = 'Tesla'

The template Jinja2 file is shown below. The column "comments", coming from the pipelet is added to the item body.

...

template/template_body_interaction.html

Code Block

language	xml

<html>
   <body>
      <table>
         <tr>
            <th>
               Type of Interaction
            </th>
            <td>
               {{item.keywords["Type of Interaction"][0]}}
            </td>
         </tr>
         <tr>
            <th>
               Comments
            </th>
            <td>
               {{item.keywords["comments"][0]}}
            </td>
         </tr>
      </table>
     
      <h4>Members</h4>
      </br>
      <table class="content">
         <tr>
            <th><b>InternalAttendees</b></th>
            <th><b>Notes</b></th>
         </tr>
         {% if pivot_data%}
         {% for line in pivot_data["Members"] %} 
         <tr>
            <td>{{line["InternalAttendees"]}}</td>
            <td>{{line["Notes"]}}</td>
         </tr>
         {% endfor %}
         {% endif %}
      </table>
   </body>
</html>

The complete facets configuration json file. Notice how the column "comments", comming from the pipelet was added in this json file, without being present in the source.

...

config/sample_facets.json

Code Block

language	js

{   
    "InteractionSubject": {
        "name": "Interaction Subject",
        "display_name": "Subject",
        "group_name": "Interactions",
        "visible": true,
        "searchable":true
    },
     "InteractionType": {
        "name": "Type of Interaction",
        "display_name": "Type",
        "group_name": "Interactions",
        "visible": true,
        "searchable":true
    },
    "DateModifiedOn": {
        "name": "Date of modification",
        "data_type": "datetime",
        "input_format_string": "%m/%d/%Y %H:%M",
        "format_instr": "YYYY-MM-DD",
        "visible": true
    },
    "Notes": {
        "name": "Notes",
        "pivotal_group": "Members",
        "delimiter": ";"
        "visible": false,
        "searchable":true
    },
    "InternalAttendees": {
        "name": "Attendees",
        "group_name": "Interactions"
        "pivotal_group": "Members",
        "delimiter": ";"
        "visible": true
    },
    "NoAttendees": {
        "name": "Number of Attendees",
        "data_type": "int",
        "visible": true
        "searchable":false

    },
    "comments": {
        "name": "Interaction comments",
        "display_name": "Comment",
        "data_type": "string",
        "visible": true,
        "searchable":true
    },
    "EntittledUsers": {
        "name": "Users",
        "auth": true
        "auth_value_mandatory": true,
        "import": false
    },
}

And the pipelets configuration file:

...

config/sample_pipelets.json

Code Block

language	js

{
    "Sample": {
        "file_location": "sample_pipelet.py",
        "config": {}
    }
}

The consume() method for the pipelet:

sample_pipelet.py

Code Block

language	py	title	sample_pipelet.py

    def consume(self, item):
         # Add new keyword, facet
        item['keywords']['comments'] = [u'This is a test load.']
        return item

...

And finally the contents of the log file, log.txt:

log.txt

Code Block

language	text	title	log.txt

2016-03-08 15:43:28,198 sq_data_load.py[12609] INFO     Starting process (version 0.3.18).
2016-03-08 15:43:28,206 sq_data_load.py[12609] INFO     Add module sample_pipelet.py to path
2016-03-08 15:43:28,206 sq_data_load.py[12609] INFO     Start load from database
2016-03-08 15:43:28,825 sources.database INFO     Execute initial load (full) for incremental processing
2016-03-08 15:43:32,319 sq_data_load.py[12609] INFO     Start loading/deleting into/from Squirro 4 rows with 2 parallel executors
2016-03-08 15:43:33,072 sq_data_load.py[12609] WARNING  One of the fields "link", "id" or "summary" is missing for "3" items
2016-03-08 15:43:33,075 sq_data_load.py[12609] INFO     Intermediate number of rows loaded to Squirro: 3
2016-03-08 15:43:33,197 sq_data_load.py[12609] WARNING  One of the fields "link", "id" or "summary" is missing for "1" items
2016-03-08 15:43:33,200 sq_data_load.py[12609] INFO     Intermediate number of rows loaded to Squirro: 4
2016-03-08 15:43:33,238 sq_data_load.py[12609] INFO     Total rows loaded to Squirro: 4
2016-03-08 15:43:33,240 sq_data_load.py[12609] INFO     Last incremental value loaded to Squirro: 2015-04-28T16:58:00
2016-03-08 15:43:33,302 sq_data_load.py[12609] INFO     Total run time: 0:00:05.108269

...

Versions Compared

Old Version 51

New Version Current

Key

Table of Contents

Sample data

CSV

Database

Data mapping

Simple usage and flow control

CSV Example

Excel Example

SQL Example

interaction.sql (used for SQL example)

ItemUploader arguments: object-id, source-id, source-name, bulk-index, bulk-index-add-summary-from-body, batch-size

Usage for bulk-index, bulk-index-add-summary-from-body, batch-size

Parallel executors and source batch size arguments: parallel-uploaders, source-batch-size

Metadata database related arguments

Facets

config/sample_facets.json

config/sample_facets.json

config/sample_pipelets.json

sample_pipelet.py

Complex job setup

interactions.sql

template/template_body_interaction.html

config/sample_facets.json

config/sample_pipelets.json

sample_pipelet.py

log.txt