Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

This section shows all the available options for the Data Loader and explains their usage.

Table of Contents

Table of Contents
excludeTable of Contents

Basic Usage

The purpose of the Data Loader tool is to extract data from different sources: supported databases, CSV and Excel files (xls and xlsx) and then upload them into Squirro.

The Data Loader is called from the command line with multiple arguments, some of which are mandatory. We will cover all command line arguments in the following chapters.

Arguments

The following table lists all the arguments:

...

Argument

...

Mandatory

...

Description

...

General Options

...

--help, -h

...

Show a help message and exit.

...

--version

...

Output the tool version and exit.

...

--verbose, -v

...

Increase log verbosity.

  • Not specified: the tool outputs all warnings and errors.

  • Specified once or more: informational messages are also output.

  • Specified twice or more: debugging messages are shown.

  • Specified three times or more: in addition to the debug logging object calls are also shown.

...

--log-file

...

Path to a log file on disk, where the log output is to be stored. If this is not specified, the log messages are shown on the console.

...

--parallel-uploaders NUMBER

Number of uploaders (default is 1).

Note

Parallel uploading is currently unsupported in the data loader on Microsoft Windows.

...

--meta-db-dir PATH

...

Directory of the SQLite metadata database.

...

--meta-db-file STRING

...

File name of the SQLite metadata database.

...

--transform-items

...

Apply item transformation on the client/local machine instead of on the server.

This flag must be used when loading data to Squirro servers with version <= 3.3.0, or optionally, when it is preferred to perform the item transformation on the client/local machine. From Squirro version 3.3.1, the default behaviour is that the item transformation takes place on the server, and specifically by the Transform Input step in the Pipeline Workflow.

...

Connection Options (see Connecting to Squirro for finding these values)

...

--token TOKEN
-t TOKEN

...

Yes

...

The Authentication Token with which to authenticate.
If the token value starts with a dash, you need to use an equal sign to specify the value like this: --token="-12345…"

...

--cluster URL
-c URL

...

The Squirro Cluster into which to import the data.

...

--project-id PROJECT_ID

...

Yes

...

The Project identifier into which to import the data.

...

Testing Options

...

--limit LIMIT

...

If set, then only this many items are sent to Squirro.

This can be used to test the options on a small subset of the data to make sure the mapping, facets and pipelets all work correctly.

...

--dry-run, -n

...

Do all the processing, except server-side actions (no uploads or facet configuration).

...

Source - Item Mapping Options

...

--map-title STRING

...

Which column is mapped to the "title" field.

...

--map-abstract STRING

...

Which column is mapped to the "summary" field.

...

--map-created-at STRING

...

Which column is mapped to the "created_at" field.

...

--map-id STRING

...

Which column is mapped to the "external_id" field.

...

--map-body [STRING…]

...

Which columns are mapped to the "body" field.

...

--map-body-mime STRING

...

Which column is mapped to the MIME type of the body. Use --body-mime to set this to a fixed value.

The typical values in this field for this are either text/html or text/plain.

...

--body-mime STRING

...

Fixed MIME type of the body. Use --map-body-mime to map to a column instead.

The typical values for this are either text/html or text/plain.

...

--map-url STRING

...

Which column is mapped to the "link" field.

...

--map–webshot-url STRING

...

Which column is mapped to the webshot_url for thumbnail extraction

...

--map–webshot-picture-hint STRING

...

Which column is mapped to the webshot_picture_hint.

...

--map-file-name STRING

...

Which column is mapped to the file-name.

...

--map-file-mime STRING

...

Which column is mapped to file mime type.

...

--map-file-data STRING

...

Which column is mapped to file contents.

...

--map-file-compressed STRING

...

Which column specifies if the file is compressed with gz. Possible values should be 'y, yes, t, true, 1', case insensitive.

...

--map-flag  STRING

...

Which column determines if the received record is an insert/update or delete. If the value is 'd' the record is deleted, otherwise is considered an insert/update

...

--map-language STRING

...

Which column is matched to the language.

...

ItemUploader Options (see ItemUploader documentation for more information)

...

--object-id OBJECT_ID

...

Object identifier.

...

--source-id SOURCE_ID

...

Source identifier, defaults to the input file name.

...

--source-name SOURCE_NAME

...

Source name, defaults to the input file name.

...

--enable-filtering

...

Enable item filtering to support alerts. (deprecated, use a suitable pipeline workflow)

...

--enable-near-duplicate-detection

...

Enable near duplicate detection to link similar documents in the Squirro user interface. (deprecated, use a suitable pipeline workflow)

...

--batch-size NUMBER

...

Batch size for uploads (default is auto - change value based on the size of the payload).

...

Item pre-processing Options

...

--body-template-file PATH

...

Jinja2 html template file with full path.

...

--title-template-file PATH

...

Jinja2 html template file with full path.

...

--abstract-template-file PATH

...

Jinja2 html template file with full path.

...

--pipelets-file PATH

...

JSON file containing the pipelets called by the db loader in execution order.

...

--pipelets-error-behavior STRING

...

Specify job behavior in case a pipelet raises an exception.

Valid values are error and warn. The default is error.

...

--facets-file PATH

...

JSON file containing facets configuration.

...

Source Options

...

--pipeline-workflow-name

...

Select the pipeline workflow by name. Only interpreted if --pipeline-workflow-id is not set. (introduced in Squirro 2.6.0)

...

--pipeline-workflow-id

...

Select the pipeline workflow by ID. If not set, use default workflow. (introduced in Squirro 2.6.0)

...

--source-type STRING

...

Type of source to load data from.

Valid values are excel, csv, database, json, filesystem, squirro, feed

...

--source-script PATH

...

Path of the Data Loader Plugin Python script.

...

--source-batch-size NUMBER

...

Batch size for source unloads (default is "1000").

...

--incremental-column STRING

...

Which column will be used as incremental reference - usually for a datetime column. If missing a full load will be done.

...

--job-id STRING

...

Used for job locking and storing the last-known value of incremental columns. If not given, this is calculated based on the source parameters.

...

--reset

...

Deletes incremental date information for the current sql query. Useful to perform an incremental load with reset.

CSV Source Options

When using a CSV file as a data source, only full load is supported. Data must have a header to determine the schema.

The command line parameters used for a CSV data source:

...

Argument

...

Mandatory

...

Description

...

--csv-delimiter CHARACTER

...

A one-character string used to separate fields. It defaults to ','.

...

--csv-quotechar CHARACTER

...

A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to ‘"’.

...

--csv-encoding STRING

...

A string specifying file encoding to be used, e.g. 'utf8'. If not provided, the loader will try to best-guess the encoding. 

...

--source-file PATH

...

Yes

...

Path of csv data file.

Usage

The following example shows a simple load from a CSV file, mapping the title, id and body of the squirro item to columns from the ‘sample.csv’ file, without using any of the additional files for facets, templating, piplets etc. All the rows will be loaded and the delimiter between fields is considered any ‘,’ (comma) found in a row. To quote fields containing special characters the double quote character ‘"’ will be used.

Code Block
languagepowershell
squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --source-name csv_sample \
    --source-file sample.csv \
    --source-type csv \
    --csv-delimiter , \
    --csv-quotechar " \
    --map-title Title \
    --map-id ID \
    --map-body Description 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Excel Source Options

When using an excel file as the data source, only full load is supported and data must have a header in order to determine the schema. If the first row of the data (after applying the boundaries, if needed) is not the header, a KeyError exception will be raised and the job will stop. In this case it’s not possible to determine the schema of the data.

The command line parameters used for an excel data source:

...

Argument

...

Mandatory

...

Description

...

--excel-sheet STRING

...

Excel sheet name. Default: get first sheet.

...

--excel-boundaries NUMBER: NUMBER

...

Limit rows loaded from excel. Format is: start_row:rows_discarded_from_end.

...

--source-file PATH

...

Yes

...

Path of excel data file.

Usage

The example below shows a simple load from an excel file, mapping  only the title, id and body of the Squirro item to columns from the ‘sample.xlsx’ excel file, without using any of the additional files for facets, templating, piplets etc. The Data Loader tool will only load the ‘Products’ sheet of the file and from this sheet the rows starting at 1 up to the last 100 rows, which will not be loaded.

Code Block
languagepowershell
squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --source-name excel_sample \
    --source-file sample.xlsx \
    --source-type excel \
    --excel-sheet Products \
    --excel-boundaries 1:100 \
    --map-title Title \
    --map-id ID \
    --map-body Description 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

JSON Source Options

When using an JSON file as the data source, only full load is supported.

Schema of the data is determined using the first item found, therefore assuming that all items have the same structure. If that's not the case, the loader might fail.

The command line parameters used for a JSON data source:

...

Argument

...

Mandatory

...

Description

...

--item-schema STRING

...

No

...

In case the JSON objects are not available as a top level structure, use this parameter to un-nest the JSON structure

...

--source-file PATH

...

No, one of --source-file or --source-folder must be provided

...

Path of JSON data file.

...

--source-folder PATH

...

No, one of --source-file or --source-folder must be provided

...

Path of directory containing multiple JSON files. Only available in CLI mode

Usage

The example below shows a load from a nested JSON file (see attachment.json for an example and facets.json to demonstrate how to map facets)

Code Block
languagebash
squirro_data_load -vv \
    --cluster "$CLUSTER" \
    --token "$TOKEN" \
    --project-id "$PROJECT_ID" \
    --source-type "json" \
    --map-title "data.headline" \
    --map-body "data.body" \
    --map-id "data.id" \
    --map-created-at "data.versionCreated" \
    --source-name "JSON WIKI TEST" \
    --item-schema "Items" \
    --facets-file "facets.json" \
    --source-file "$SOURCE_FILE" \
    --transform-items

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $TOKEN, $CLUSTER and $PROJECT_ID have been declared beforehand.

Database Options

When loading from a database, both full and incremental load are supported, using a select query supplied as a string or in a file. The script uses uses SQLAlchemy to connect to any database.

Tested databases:

  • Postgres and all databases using the postgres driver for connection (Greenplum, Redshift etc)

  • Microsoft SQL

  • Oracle

  • MySQL

  • SQLite

The command line parameters used for a database source:

...

Argument

...

Mandatory

...

Description

...

--db-connection STRING

...

Yes

...

Database connection string.

...

--input-file PATH

...

File containing the SQL code.

...

--input-query STRING

...

SQL query.

Note that the --input-file and --input-query arguments are mutually exclusive.

Usage

In the following example we are performing a simple load from the database, mapping the title, id and body of the Squirro item to columns from a database table that is interrogated in the sample.sql file. The Data Loader tool makes a full load of all the rows specified in the sample.sql file since the argument --incremental-column is not set.

Code Block
languagepowershell
squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --db-connection $db_connection_string \
    --source-name db_sample \
    --input-file $script_dir/interaction.sql \
    --source-type database \
    --map-title Title \
    --map-id ID \
    --map-body Description 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Filesystem Options

The command line parameters used for a filesystem source:

...

Option

...

Mandatory

...

Description

...

--folder PATH

...

No, one of the --folder  or --zip-file-path must be provided

...

Filesystem location that will be indexed in Squirro

...

--zip-file-path

...

No, one of the --folder  or --zip-file-path must be provided

...

Absolute path to a zip file containing all the files "

"to be imported into Squirro

...

--deletions

...

No

If set, then any files that are no longer present on the file system are also removed from Squirro. To use this, the --map-flag option also needs to be used to ensure new/updated and deleted files are handled correctly:

Code Block
--map-flag flag

...

--include-file PATH

...

No

...

Path to a file containing inclusion rules.

This is a list of patterns that files need to be match to be indexed. If provided, then only files that match at least one pattern are indexed.

...

--exclude-file PATH

...

No

...

Path to a file containing exclusion rules.

This is a list of patterns for files that should not be indexed. Any file that matches at least one such pattern is not indexed, independent of whether it also matches the include rules.

...

--skip-errors

...

No

...

Ignore any file system errors when processing individual files. This way a single file system read error does not prevent the entire load from succeeding. If the error is temporary, then the file will be picked up in the next load.

...

Performance Optimisations

...

--convert-file PATH

...

No

...

Path to a file containing conversion file patterns.

Files that match any of these rules will be indexed with full content. See Content Extraction for the file types that Squirro supports full indexing for. By limiting to a smaller number of extensions, this allows the file system loader to only process content in Squirro for which indexing will be effective.

...

--file-size-limit

...

No

...

Maximum size in megabytes of files that should be indexed with content. Also see --index-all below.

...

--index-all

...

No

...

If set, then files over the --file-size-limit are indexed, but without their content. In the default case of this not being set, those files are skipped entirely.

...

--batch-size-limit

...

No

...

Maximum size of requests sent to Squirro's API for indexing of files.

...

--deduplicate

...

No

...

Deduplicate files based on file content. Exact duplicates are only ever indexed ones, with duplicates ignored.

...

Logging and Debugging

...

--log-excludes

...

No

...

Log matches for inclusion/exclusion rules.

...

--progress

...

No

...

Log progress verbosely.

Usage

File system loading is implemented as a Data Loader plugin and invoked with the usual data loader.

Code Block
languagepowershell
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type filesystem \
    --folder FOLDER_TO_INDEX \
    --map-title "title" \
    --map-file-name "file_name" \
    --map-file-mime "file_mime" \
    --map-file-data "file_data" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
	--facets-file facets.json

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Squirro Options

When Loading data from any squirro source following command line parameters can be used:

...

Argument

...

Mandatory

...

Description

...

--source-cluster

...

Yes

...

Source Squirro Cluster URL

...

--source-token

...

Yes

...

Source Squirro Token

...

--source-project-id

...

Yes

...

Source Squirro Project ID

...

--source-query

...

No

...

Squirro query

...

--include-facets

...

No

...

If set, then keywords are included

 

...

--facet-delimiter

...

No

...

Character or string used to delimit facets with multiple values

...

--include-entities

...

No

...

If set, then entities are included

...

--include-webshot

...

No

...

If set, then webshot are included only when one of --map-webshot-url or --map-webshot-picture-hint is set

...

--progress

...

No

...

If set, detailed per row progress information is logged

...

--deduplicate

...

No

...

If set, items will be deduplicated based on titles

...

--retry

...

No

...

Number of retries to make in case of an error

Usage

A simple example for loading a data from a Squirro source is given below:

Code Block
languagepowershell
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
	--source-cluster $SOURCE_CLUSTER \
	--source-token $SOURCE_TOKEN \
    --source-project-id $SOURCE_PROJECT_ID \
	--source-type squirro \
	--source-query "*" \
	--include-facets \
	--include-entities \
    --map-title "title" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
    --progress \
	--deduplicate \
	--retry 5
	

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Feed Options

When Loading data from any feed source following command line parameters can be used:

...

Argument

...

Mandatory

...

Description

...

--feed-sources

...

Yes

...

Space-separated list of URLs (strings).

...

--query-timeout

...

No

...

Timeout (in seconds) for fetching the feed

...

--max-backoff

...

No

...

Maximum number of hours to wait if the feed update frequency is low

...

--custom-date-field

...

No

...

For non-standard rss datetime fields, enter the field

...

--custom-date-format

...

No

...

For non-standard rss datetime formats,  enter the format i.e. %m/%d/%Y.

 

...

--rss-username

...

No

...

Username for RSS Basic Authentication

...

--rss-password

...

No

...

Password for RSS Basic Authentication

Usage

A simple example for loading data from feed source is given by:

Code Block
languagepowershell
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type feed \
    --source-name feed_sample \
	--feed-sources 'https://www.theregister.co.uk/headlines.atom' 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' \
    --map-title "title" \
    --map-id "id" \
    --map-body "description" \
    --map-created-at "created_at" \
	--batch-size 100 \
    --source-batch-size 100 \
    --facets-file facets.json

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

User defined sources

If data needs to be extracted from other sources than the ones described above there is the option to write a custom source.
To do this a new Python module must be created and has to implement the abstract base class DataSource.
In this way the Data Loader can index data from other sources without modifications.

It is very simple to use the custom source, just supply the full path of the Python module by using the command line argument --source-script instead of using --source-type.
Adding the custom module to the PYTHONPATH and importing it will be done automatically by the loaderThis page can now be found at Data Loader Reference on the Squirro Docs site.