Introduction
A Dataloader Template is the step towards accelerating the process of loading data in Squirro. By making the use of pre-defined custom mappings while loading data, the loading process can be reduced to just a few clicks. While constructing a dataloader plugin, the author can now create a default template, which is elaborated in this section.
Construct a dataloader plugin with a template
The dataloader plugin with a default template can be constructed with the help of thedataloader_plugin.json
placed in the plugin directory.
Code Block |
---|
{
"title": "Feed Testing",
"description": "Subscribe to an RSS or Atom feed.",
"plugin_file": "feed_plugin.py",
"scheduling_options_file": "scheduling_options.json",
"dataloader_options_file": "mappings.json",
"pipeline_workflow_file": "pipeline_workflow.json",
"category": "web",
"thumbnail_file": "feed.png",
"auth_file": "auth.py",
"override": "feed"
} |
Following fields in the above json contain default template information and are crucial if the plugin has to be used with a template.
1. The scheduling_options_file
field of the dataloader_plugin.json
requires a json file which defined the default scheduling parameters for the plugin. The scheduling_options.json
shown below exemplify such json file.
Code Block |
---|
{
"schedule": true,
"first_run": "2020-08-31T11:30:00",
"repeat": "15m"
} |
2. The pipeline_workflow_file
field of the dataloader_plugin.json
requires a json file which sets the steps for the default ingestion pipeline workflow. The pipeline_workflow.json
shown below depicts the usage of such json for this field.
Code Block |
---|
{
"steps": [
{
"config": {
"policy": "replace"
},
"id": "deduplication",
"name": "Deduplication",
"type": "deduplication"
},
{
"id": "language-detection",
"name": "Language Detection",
"type": "language-detection"
},
{
"id": "cleanup",
"name": "Content Standardization",
"type": "cleanup"
},
{
"id": "index",
"name": "Indexing",
"type": "index"
}
]
} |
3. The dataloader_options_file
requires a json file used to set the mapping of various fields coming from source to corresponding squirro item fields. mappings.json
exemplifies such usage.
Code Block |
---|
{
"map_id": "id",
"map_title": "title",
"map_created_at": "created_at",
"map_url": "link",
"map_body": "body",
"facets_file": "facets.json"
} |
On Setting the above described fields, the dataloader plugin with its templates can be created. It can be uploaded using squirro_asset.
An example of creation of the feed_plugin can be found here: https://github.com/squirro/dataloader-plugins/tree/feed-template-testing
Validate Default Mappings
To validate whether a plugin can be used with a template, has_default_mappings
is used. The has_default_mappings
is True
if the dataloader plugin has above three specified fields set.
Code Block |
---|
required_keys = [
"dataloader_options_file",
"scheduling_options_file",
"pipeline_workflow_file",
] |
NOTE: If the above required_keys
are not set during the time of upload, The dataloader plugin cannot be used as template.
Squirro Client usage for default fields
To leverage the default settings for template with squirro client, a config parameter is used with the client methods. The config
can be defined as below.
Code Block |
---|
config = {
"dataloader_options": {"plugin_name": "feed_plugin"},
"dataloader_plugin_options": {
"feed_sources": ["https://www.nzz.ch/recent.rss"],
"query_timeout": 30,
"max_backoff": 24,
"custom_date_field": "",
"custom_date_format": "",
"rss_username": "",
"rss_password": "",
}
} |
After the successful upload of the dataloader plugin, the new source can be created by the client using:
Code Block |
---|
new_source = client.new_source(
project_id=project_id,
name=name,
config=config,
pipeline_workflow_id=None,
scheduling_options={'schedule': True, 'repeat': '30m'},
use_default_options=use_default_dataloader_options
) |
To fetch a source which includes such config:
...
This page can now be found at Dataloader Templates on the Squirro Docs site.