Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

A Dataloader Template is the step towards accelerating the process of loading data in Squirro. By making the use of pre-defined custom mappings while loading data, the loading process can be reduced to just a few clicks. While constructing a dataloader plugin, the author can now create a default template, which is elaborated in this section.

Construct a dataloader plugin with a template

The dataloader plugin with a default template can be constructed with the help of thedataloader_plugin.json placed in the plugin directory.

Code Block
{
    "title": "Feed Testing",
    "description": "Subscribe to an RSS or Atom feed.",
    "plugin_file": "feed_plugin.py",
    "scheduling_options_file": "scheduling_options.json",
    "dataloader_options_file": "mappings.json",
    "pipeline_workflow_file": "pipeline_workflow.json",
    "category": "web",
    "thumbnail_file": "feed.png",
    "auth_file": "auth.py",
    "override": "feed"
}

Following fields in the above json contain default template information and are crucial if the plugin has to be used with a template.

1. The scheduling_options_file field of the dataloader_plugin.json requires a json file which defined the default scheduling parameters for the plugin. The scheduling_options.json shown below exemplify such json file.

Code Block
{
    "schedule": true,
    "first_run": "2020-08-31T11:30:00",
    "repeat": "15m"
}

2. The pipeline_workflow_file field of the dataloader_plugin.json requires a json file which sets the steps for the default ingestion pipeline workflow. The pipeline_workflow.json shown below depicts the usage of such json for this field.

Code Block
{
    "steps": [
        {
            "config": {
                "policy": "replace"
            },
            "id": "deduplication",
            "name": "Deduplication",
            "type": "deduplication"
        },
        {
            "id": "language-detection",
            "name": "Language Detection",
            "type": "language-detection"
        },
        {
            "id": "cleanup",
            "name": "Content Standardization",
            "type": "cleanup"
        },
        {
            "id": "index",
            "name": "Indexing",
            "type": "index"
        }
    ]
}

3. The dataloader_options_file requires a json file used to set the mapping of various fields coming from source to corresponding squirro item fields. mappings.json exemplifies such usage.

Code Block
{
    "map_id": "id",
    "map_title": "title",
    "map_created_at": "created_at",
    "map_url": "link",
    "map_body": "body",
    "facets_file": "facets.json"
}

On Setting the above described fields, the dataloader plugin with its templates can be created. It can be uploaded using squirro_asset.

An example of creation of the feed_plugin can be found here: https://github.com/squirro/dataloader-plugins/tree/feed-template-testing

Validate Default Mappings

To validate whether a plugin can be used with a template, has_default_mappings is used. The has_default_mappings is True if the dataloader plugin has above three specified fields set.

Code Block
required_keys = [
            "dataloader_options_file",
            "scheduling_options_file",
            "pipeline_workflow_file",
        ]

NOTE: If the above required_keys are not set during the time of upload, The dataloader plugin cannot be used as template.

Squirro Client usage for default fields

To leverage the default settings for template with squirro client, a config parameter is used with the client methods. The config can be defined as below.

Code Block
config = {
    "dataloader_options": {"plugin_name": "feed_plugin"},
    "dataloader_plugin_options": {
        "feed_sources": ["https://www.nzz.ch/recent.rss"],
        "query_timeout": 30,
        "max_backoff": 24,
        "custom_date_field": "",
        "custom_date_format": "",
        "rss_username": "",
        "rss_password": "",
    }
}

After the successful upload of the dataloader plugin, the new source can be created by the client using:

Code Block
new_source = client.new_source(
    project_id=project_id,
    name=name,
    config=config,
    pipeline_workflow_id=None,
    scheduling_options={'schedule': True, 'repeat': '30m'},
    use_default_options=use_default_dataloader_options
)

To fetch a source which includes such config:

...

This page can now be found at Dataloader Templates on the Squirro Docs site.