Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt
This section shows all the available options for the Data Loader and explains their usage.

Table of Contents

true
Table of Contents
outline
excludeTable of Contents

Basic Usage

...


Code Block
languagepowershell
squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --source-name csv_sample \
    --source-file sample.csv \
    --source-type csv \
    --csv-delimiter , \
    --csv-quotechar " \
    --map-title Title \
    --map-id ID \
    --map-body Description 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Excel Source Options

When using an excel file as the data source, only full load is supported and data must have a header in order to determine the schema. If the first row of the data (after applying the boundaries, if needed) is not the header, a KeyError exception will be raised and the job will stop. In this case it’s not possible to determine the schema of the data.

The command line parameters used for an excel data source:

Argument

Mandatory

Description

--excel-sheet STRING


Excel sheet name. Default: get first sheet.

--excel-boundaries NUMBER: NUMBER


Limit rows loaded from excel. Format is: start_row:rows_discarded_from_end.

--source-file PATH

Yes

Path of excel data file.

Usage

The example below shows a simple load from an excel file, mapping  only the title, id and body of the Squirro item to columns from the ‘sample.xlsx’ excel file, without using any of the additional files for facets, templating, piplets etc. The Data Loader tool will only load the ‘Products’ sheet of the file and from this sheet the rows starting at 1 up to the last 100 rows, which will not be loaded.

Code Block
languagepowershell
squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --source-name excel_sample \
    --source-file sample.xlsx \
    --source-type excel \
    --excel-sheet Products \
    --excel-boundaries 1:100 \
    --map-title Title \
    --map-id ID \
    --map-body Description 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

JSON Source Options

When using an JSON file as the data source, only full load is supported.

Schema of the data is determined using the first item found, therefore assuming that all items have the same structure. If that's not the case, the loader might fail.

The command line parameters used for a JSON data source:

Argument

Mandatory

Description

--item-schema STRING

No

In case the JSON objects are not available as a top level structure, use this parameter to un-nest the JSON structure

--source-file PATH

No, one of --source-file or --source-folder must be provided

Path of JSON data file.

--source-folder PATHNo, one of --source-file or --source-folder must be providedPath of directory containing multiple JSON files. Only available in CLI mode

Usage

The example below shows a load from a nested JSON file (see attachment.json for an example and facets.json to demonstrate how to map facets)

Code Block
languagebash
squirro_data_load -vv \
    --cluster "$CLUSTER" \
    --token "$TOKEN" \
    --project-id "$PROJECT_ID" \
    --source-type 'json' \
    --map-title 'data.headline' \
    --map-body 'data.body' \
    --map-id 'data.id' \
    --map-created-at 'data.versionCreated' \
    --source-name 'JSON WIKI TEST' \
    --facets-file 'facets.json' \
    --source-file "$SOURCE_FILE" \
    --item-schema 'Items' 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $TOKEN, $CLUSTER and $PROJECT_ID have been declared beforehand.

Database Options

When loading from a database, both full and incremental load are supported, using a select query supplied as a string or in a file. The script uses uses SQLAlchemy to connect to any database.

Tested databases:

  • Postgres and all databases using the postgres driver for connection (Greenplum, Redshift etc)
  • Microsoft SQL
  • Oracle
  • MySQL
  • SQLite

The command line parameters used for a database source:

Argument

Mandatory

Description

--db-connection STRING

Yes

Database connection string.

--input-file PATH


File containing the SQL code.

--input-query STRING


SQL query.

Note that the --input-file and --input-query arguments are mutually exclusive.

Usage

In the following example we are performing a simple load from the database, mapping the title, id and body of the Squirro item to columns from a database table that is interrogated in the sample.sql file. The Data Loader tool makes a full load of all the rows specified in the sample.sql file since the argument --incremental-column is not set.

Code Block
languagepowershell
squirro_data_load -v \
    --token $token \
    --cluster $cluster \
    --project-id $project_id \
    --db-connection $db_connection_string \
    --source-name db_sample \
    --input-file $script_dir/interaction.sql \
    --source-type database \
    --map-title Title \
    --map-id ID \
    --map-body Description 

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

...

OptionMandatoryDescription
--folder PATHNo, one of the --folder  or --zip-file-path must be providedFilesystem location that will be indexed in Squirro
--zip-file-pathNo, one of the --folder  or --zip-file-path must be provided
Absolute path to a zip file containing all the files "
"to be imported into Squirro
--deletionsNo

If set, then any files that are no longer present on the file system are also removed from Squirro. To use this, the --map-flag option also needs to be used to ensure new/updated and deleted files are handled correctly:

No Format
--map-flag flag


--include-file PATHNo

Path to a file containing inclusion rules.

This is a list of patterns that files need to be match to be indexed. If provided, then only files that match at least one pattern are indexed.

--exclude-file PATHNo

Path to a file containing exclusion rules.

This is a list of patterns for files that should not be indexed. Any file that matches at least one such pattern is not indexed, independent of whether it also matches the include rules.

--skip-errorsNoIgnore any file system errors when processing individual files. This way a single file system read error does not prevent the entire load from succeeding. If the error is temporary, then the file will be picked up in the next load.
Performance Optimisations
--convert-file PATHNo

Path to a file containing conversion file patterns.

Files that match any of these rules will be indexed with full content. See Content Conversion for the file types that Squirro supports full indexing for. By limiting to a smaller number of extensions, this allows the file system loader to only process content in Squirro for which indexing will be effective.

--file-size-limitNoMaximum size in megabytes of files that should be indexed with content. Also see --index-all below.
--index-allNoIf set, then files over the --file-size-limit are indexed, but without their content. In the default case of this not being set, those files are skipped entirely.
--batch-size-limitNoMaximum size of requests sent to Squirro's API for indexing of files.
--deduplicateNoDeduplicate files based on file content. Exact duplicates are only ever indexed ones, with duplicates ignored.
Logging and Debugging
--log-excludesNoLog matches for inclusion/exclusion rules.
--progressNoLog progress verbosely.

...

Code Block
languagepowershell
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type filesystem \
    --folder FOLDER_TO_INDEX \
    --map-title "title" \
    --map-file-name "file_name" \
    --map-file-mime "file_mime" \
    --map-file-data "file_data" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
	--facets-file facets.json

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Squirro Options

When Loading data from any squirro source following command line parameters can be used:

...

Code Block
languagepowershell
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
	--source-cluster $SOURCE_CLUSTER \
	--source-token $SOURCE_TOKEN \
    --source-project-id $SOURCE_PROJECT_ID \
	--source-type squirro \
	--source-query "*" \
	--include-facets \
	--include-entities \
    --map-title "title" \
    --map-id "id" \
    --map-url "link" \
    --map-created-at "created_at" \
    --progress \
	--deduplicate \
	--retry 5
	

Note that the lines have been wrapped with the backslash (\) at the end of each line. On bash/windows setup you will need to use circumflex (^) instead.

This example assumes that $token, $cluster and $project_id have been declared beforehand.

Feed Options

When Loading data from any feed source following command line parameters can be used:

Argument

Mandatory

Description

--feed-sources

Yes

Space-separated list of URLs (strings).

--query-timeout

No

Timeout (in seconds) for fetching the feed

--max-backoff

No

Maximum number of hours to wait if the feed update frequency is low
--custom-date-fieldNo
For non-standard rss datetime fields, enter the field
--custom-date-formatNo
For non-standard rss datetime formats,  enter the format i.e. %m/%d/%Y.
 
--rss-usernameNo
Username for RSS Basic Authentication
--rss-passwordNo
Password for RSS Basic Authentication

Usage

A simple example for loading data from feed source is given by:

Code Block
languagepowershell
squirro_data_load -v \
    --token $TOKEN \
    --cluster $CLUSTER \
    --project-id $PROJECT_ID \
    --source-type feed \
    --source-name feed_sample \
	--feed-sources 'https://www.theregister.co.uk/headlines.atom' 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' \
    --map-title "title" \
    --map-id "id" \
    --map-body "description" \
    --map-created-at "created_at" \
	--batch-size 100 \
    --source-batch-size 100 \
    --facets-file facets.json

Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.

User defined sources

If data needs to be extracted from other sources than the ones described above there is the option to write a custom source.
To do this a new Python module must be created and has to implement the abstract base class DataSource.
In this way the Data Loader can index data from other sources without modifications.

...