Excerpt |
---|
This section shows all the available options for the Data Loader and explains their usage. |
...
Argument | Mandatory | Description | ||
---|---|---|---|---|
General Options | ||||
--help, -h | Show a help message and exit. | |||
--version | Output the tool version and exit. | |||
--verbose, -v | Increase log verbosity.
| |||
--log-file | Path to a log file on disk, where the log output is to be stored. If this is not specified, the log messages are shown on the console. | |||
--parallel-uploaders NUMBER | Number of uploaders (default is 1).
| |||
--meta-db-dir PATH | Directory of the SQLite metadata database. | |||
--meta-db-file STRING | File name of the SQLite metadata database. | |||
Connection Options (see Connecting to Squirro for finding these values) | ||||
--token TOKEN | Yes | The Authentication Token with which to authenticate. If the token value starts with a dash, you need to use an equal sign to specify the value like this: --token="-12345…" | ||
--cluster URL | The Squirro Cluster into which to import the data. | |||
--project-id PROJECT_ID | Yes | The Project identifier into which to import the data. | ||
Testing Options | ||||
--limit LIMIT | If set, then only this many items are sent to Squirro. This can be used to test the options on a small subset of the data to make sure the mapping, facets and pipelets all work correctly. | |||
--dry-run, -n | Do all the processing, except server-side actions (no uploads or facet configuration). | |||
Source - Item Mapping Options | ||||
--map-title STRING | Which column is mapped to the "title" field. | |||
--map-abstract STRING | Which column is mapped to the "summary" field. | |||
--map-created-at STRING | Which column is mapped to the "created_at" field. | |||
--map-id STRING | Which column is mapped to the "external_id" field. | |||
--map-body [STRING…] | Which columns are mapped to the "body" field. | |||
--map-body-mime STRING | Which column is mapped to the MIME type of the body. Use The typical values in this field for this are either | |||
--body-mime STRING | Fixed MIME type of the body. Use The typical values for this are either | |||
--map-url STRING | Which column is mapped to the "link" field. | |||
--map-file-name STRING | Which column is mapped to the file-name. | |||
--map-file-mime STRING | Which column is mapped to file mime type. | |||
--map-file-data STRING | Which column is mapped to file contents. | |||
--map-file-compressed STRING | Which column specifies if the file is compressed with gz. Possible values should be 'y, yes, t, true, 1', case insensitive. | |||
--map-flag STRING | Which column determines if the received record is an insert/update or delete. If the value is 'd' the record is deleted, otherwise is considered an insert/update | |||
ItemUploader Options (see ItemUploader documentation for more information) | ||||
--object-id OBJECT_ID | Object identifier. | |||
--source-id SOURCE_ID | Source identifier, defaults to the input file name. | |||
--source-name SOURCE_NAME | Source name, defaults to the input file name. | |||
--bulk-index | If set, the cluster is instructed to index data in bulk. | |||
--bulk-index-add-summary-from-body | If set, the cluster is instructed to add the summary from the body during bulk indexing. | |||
--enable-filtering | Enable item filtering to support alerts. | |||
--enable-near-duplicate-detection | Enable near duplicate detection to link similar documents in the Squirro user interface. (Added with the Squirro 2.4.6 release.) | |||
--batch-size NUMBER | Batch size for uploads (default is auto - change value based on the size of the payload). | |||
Item pre-processing Options | ||||
--body-template-file PATH | Jinja2 html template file with full path. | |||
--title-template-file PATH | Jinja2 html template file with full path. | |||
--abstract-template-file PATH | Jinja2 html template file with full path. | |||
--pipelets-file PATH | JSON file containing the pipelets called by the db loader in execution order. | |||
--pipelets-error-behavior STRING | Specify job behavior in case a pipelet raises an exception. Valid values are | |||
--facets-file PATH | JSON file containing facets configuration. | |||
Source Options | ||||
--source-type STRING | Type of source to load data from. Valid values are | |||
--source-script PATH | Path of the Data Loader Plugin Python script. | |||
--source-batch-size NUMBER | Batch size for source unloads (default is "1000"). | |||
--incremental-column STRING | Which column will be used as incremental reference - usually for a datetime column. If missing a full load will be done. | |||
--job-id STRING | Used for job locking and storing the last-known value of incremental columns. If not given, this is calculated based on the source parameters. | |||
--reset | Deletes incremental date information for the current sql query. Useful to perform an incremental load with reset. |
...
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token $token ^ --cluster $cluster ^ --project-id $project_id ^ --source-name csv_sample ^ --source-file sample.csv ^ --source-type csv ^ --csv-delimiter , ^ --csv-quotechar " ^ --map-title Title ^ --map-id ID ^ --map-body Description |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
This example assumes that $token, $cluster and $project_id have been declared beforehand.
Excel Source Options
When using an excel file as the data source, only full load is supported and data must have a header in order to determine the schema. If the first row of the data (after applying the boundaries, if needed) is not the header, a KeyError exception will be raised and the job will stop. In this case it’s not possible to determine the schema of the data.
The command line parameters used for an excel data source:
Argument | Mandatory | Description |
---|---|---|
--excel-sheet STRING | Excel sheet name. Default: get first sheet. | |
--excel-boundaries NUMBER: NUMBER | Limit rows loaded from excel. Format is: start_row:rows_discarded_from_end. | |
--source-file PATH | Yes | Path of excel data file. |
Usage
The example below shows a simple load from an excel file, mapping only the title, id and body of the Squirro item to columns from the ‘sample.xlsx’ excel file, without using any of the additional files for facets, templating, piplets etc. The Data Loader tool will only load the ‘Products’ sheet of the file and from this sheet the rows starting at 1 up to the last 100 rows, which will not be loaded.
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token $token ^ --cluster $cluster ^ --project-id $project_id ^ --source-name excel_sample ^ --source-file sample.xlsx ^ --source-type excel ^ --excel-sheet Products ^ --excel-boundaries 1:100 ^ --map-title Title ^ --map-id ID ^ --map-body Description |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
This example assumes that $token, $cluster and $project_id have been declared beforehand.
Database Options
When loading from a database, both full and incremental load are supported, using a select query supplied as a string or in a file. The script uses uses SQLAlchemy to connect to any database.
Tested databases:
- Postgres and all databases using the postgres driver for connection (Greenplum, Redshift etc)
- Microsoft SQL
- Oracle
- MySQL
- SQLite
The command line parameters used for a database source:
Argument | Mandatory | Description |
---|---|---|
--db-connection STRING | Yes | Database connection string. |
--input-file PATH | File containing the SQL code. | |
--input-query STRING | SQL query. |
Note that the --input-file
and --input-query
arguments are mutually exclusive.
Usage
In the following example we are performing a simple load from the database, mapping the title, id and body of the Squirro item to columns from a database table that is interrogated in the sample.sql file. The Data Loader tool makes a full load of all the rows specified in the sample.sql file since the argument --incremental-column
is not set.
Code Block | ||
---|---|---|
| ||
squirro_data_load -v ^ --token $token ^ --cluster $cluster ^ --project-id $project_id ^ --db-connection $db_connection_string ^ --source-name db_sample ^ --input-file $script_dir/interaction.sql ^ --source-type database ^ --map-title Title ^ --map-id ID ^ --map-body Description |
Note that the lines have been wrapped with the circumflex (^) at the end of each line. On Mac and Linux you will need to use backslash (\) instead.
This example assumes that $token, $cluster and $project_id have been declared beforehand.
...