Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Excerpt
This section shows all the available options for the Data Loader and explains their usage.

...

The following table lists all the arguments:

Argument

Mandatory

Description

General Options

-h

 

Show a help message and exit.

--version

 

Output the tool version and exit.

--verbose, -v

 

Increase log verbosity.

  • Not specified: the tool outputs all warnings and errors.
  • Specified once or more: informational messages are also output.
  • Specified twice or more: debugging messages are shown.
  • Specified three times or more: in addition to the debug logging object calls are also shown.

--log-file

 

Path to a log file on disk, where the log output is to be stored. If this is not specified, the log messages are shown on the console.

--parallel-uploaders NUMBER

 

Number of uploaders (default is 1).

--meta-db-dir PATH

 

Directory of the SQLite metadata database.

--meta-db-file STRING

 

File name of the SQLite metadata database.

Connection Options (see Connecting to Squirro for finding these values)

--token TOKEN
-t TOKEN

Yes

The Authentication Token with which to authenticate.
If the token value starts with a dash, you need to use an equal sign to specify the value like this: --token="-12345…"

--cluster URL
-c URL

 

The Squirro Cluster into which to import the data.

--project-id PROJECT_ID

Yes

The Project identifier into which to import the data.

Source - Item Mapping Options

--map-title STRING

 

Which column is mapped to the "title" field.

--map-abstract STRING

 

Which column is mapped to the "summary" field.

--map-created-at STRING

 

Which column is mapped to the "created_at" field.

--map-id STRING

 

Which column is mapped to the "external_id" field.

--map-body [STRING…]

 

Which columns are mapped to the "body" field.

--map-url STRING

 

Which column is mapped to the "link" field.

--map-file-name STRING

 

Which column is mapped to the file-name.

--map-file-mime STRING

 

Which column is mapped to file mime type.

--map-file-data STRING

 

Which column is mapped to file contents.

--map-file-compressed STRING

 

Which column specifies if the file is compressed with gz. Possible values should be 'y, yes, t, true, 1', case insensitive.

--map-flag  STRING

 

Which column determines if the received record is an insert/update or delete. If the value is 'd' the record is deleted, otherwise is considered an insert/update

ItemUploader Options (see ItemUploader documentation for more information)

--object-id OBJECT_ID

 

Object identifier.

--source-id SOURCE_ID

 

Source identifier, defaults to the input file name.

--source-name SOURCE_NAME

 

Source name, defaults to the input file name.

--bulk-index

 

If set, the cluster is instructed to index data in bulk.

--bulk-index-add-summary-from-body

 

If set, the cluster is instructed to add the summary from the body during bulk indexing.

--batch-size NUMBER

 

Batch size for uploads (default is auto - change value based on the size of the payload).

Item pre-processing Options

--body-template-file PATH

 

Jinja2 html template file with full path.

--title-template-file PATH

 

Jinja2 html template file with full path.

--abstract-template-file PATH

 

Jinja2 html template file with full path.

--pipelets-file PATH

 

Json file containing the pipelets called by the db loader in execution order.

--pipelets-error-behavior STRING

 

Specify job behavior in case a pipelet raises an exception. Default is error.

--facets-file PATH

 

Json file containing facets configuration.

Source Options

--source-type STRING

 

Type of source to load data from.

--source-script PATH

 

Path of DataSource python script.

--source-batch-size NUMBER

 

Batch size for source unloads (default is "1000").

--incremental-column STRING

 

Which date/datetime column will be used as incremental reference. If missing a full load will be done.

--reset

 

Deletes incremental date information for the current sql query. Useful to perform an incremental load with reset.

CSV Source Options

When using a CSV file as a data source, only full load is supported. Data must have a header to determine the schema.

The command line parameters used for a CSV data source:

Argument

Mandatory

Description

--csv-delimiter CHARACTER

 

A one-character string used to separate fields. It defaults to ','.

--csv-quotechar CHARACTER

 

A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to ‘"’.

--source-file PATH

Yes

Path of csv data file.

Usage

The following example shows a simple load from a CSV file, mapping the title, id and body of the squirro item to columns from the ‘sample.csv’ file, without using any of the additional files for facets, templating, piplets etc. All the rows will be loaded and the delimiter between fields is considered any ‘,’ (comma) found in a row. To quote fields containing special characters the double quote character ‘"’ will be used.

 

Code Block
languagepowershell
squirro_data_load -v ^
    --token $token ^
    --cluster $cluster ^
    --project-id $project_id ^
    --source-name csv_sample ^
    --source-file sample.csv ^
    --source-type csv ^
    --csv-delimiter , ^
    --csv-quotechar " ^
    --map-title Title ^
    --map-id ID ^
    --map-body Description 

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Excel Source Options

When using an excel file as the data source, only full load is supported and data must have a header in order to determine the schema. If the first row of the data (after applying the boundaries, if needed) is not the header, a KeyError exception will be raised and the job will stop. In this case it’s not possible to determine the schema of the data.

The command line parameters used for an excel data source:

Argument

Mandatory

Description

--excel-sheet STRING

 

Excel sheet name. Default: get first sheet.

--excel-boundaries NUMBER: NUMBER

 

Limit rows loaded from excel. Format is: start_row:rows_discarded_from_end.

--source-file PATH

Yes

Path of excel data file.

Usage

The example below shows a simple load from an excel file, mapping  only the title, id and body of the Squirro item to columns from the ‘sample.xlsx’ excel file, without using any of the additional files for facets, templating, piplets etc. The Data Loader tool will only load the ‘Products’ sheet of the file and from this sheet the rows starting at 1 up to the last 100 rows, which will not be loaded.

Code Block
languagepowershell
squirro_data_load -v ^
    --token $token ^
    --cluster $cluster ^
    --project-id $project_id ^
    --source-name excel_sample ^
    --source-file sample.xlsx ^
    --source-type excel ^
    --excel-sheet Products ^
    --excel-boundaries 1:100 ^
    --map-title Title ^
    --map-id ID ^
    --map-body Description 

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Database Options

When loading from a database, both full and incremental load are supported, using a select query supplied as a string or in a file. The script uses uses SQLAlchemy to connect to any database.

Tested databases:

  • Postgres and all databases using the postgres driver for connection (Greenplum, Redshift etc)
  • Microsoft SQL
  • Oracle
  • MySQL
  • SQLite

The command line parameters used for a database source:

Argument

Mandatory

Description

--db-connection STRING

Yes

Database connection string.

--input-file PATH

 

File containing the SQL code.

--input-query STRING

 

SQL query.

Note that the --input-file and --input-query arguments are mutually exclusive.

Usage

In the following example we are performing a simple load from the database, mapping the title, id and body of the Squirro item to columns from a database table that is interrogated in the sample.sql file. The Data Loader tool makes a full load of all the rows specified in the sample.sql file since the argument --incremental-column is not set.

Code Block
languagepowershell
squirro_data_load -v ^
    --token $token ^
    --cluster $cluster ^
    --project-id $project_id ^
    --db-connection $db_connection_string ^
    --source-name db_sample ^
    --input-file $script_dir/interaction.sql ^
    --source-type database ^
    --map-title Title ^
    --map-id ID ^
    --map-body Description 

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

...

If data needs to be extracted from other sources than the ones described above there is the option to write a custom source. To do this a new Python module must be created and has to implement the abstract base class DataSource. In this way the Data Loader can index data from other sources without modifications.

Is It is very simple to use the custom source, just supply the full path of the Python module by using the command line argument --source-script instead of using --source-type. Adding the custom module to the PYTHONPATH and importing it will be done automatically by the loader.