Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

One-click connectors are an easy way to fetch data from different sources without any configuration on the user side. By using OAuth2 authentication and providing pre-defined custom mappings, the whole loading process is very intuitive, easy and it is almost just ONE CLICK.

Excerpt

This tutorial goes step by step through building a custom one-click connector. It describes in detail the whole process from setting up the environment to upload the custom plugin. By providing code from already-built examples it gives an insight into how particular parts actually work, and it shows the best practices of building one-click connectors.

Prerequisites

...

Introduction

One-click connectors are an easy way to fetch data from different sources without any configuration on the user side. By using OAuth2 authentication and providing pre-defined custom mappings, the whole loading process is very intuitive, easy and it is almost just ONE CLICK.

Prerequisites

To get started, install the Squirro Toolbox. It can be done it either using Toolbox or https://squirro.atlassian.net/wiki/spaces/BOX.

...

Example of file tree for OneDrive connector:

File

Required?

Purpose

__init__.py

marks it as Python package

README.md

escribes plugin installation steps, like OAuth2 app configuration

auth.py

deals with OAuth2 or other authorization process

dataloader_plugin.json

Yes

contains references to other files and plugin name

facets.json

OCC only

describes common selection of facets

icon.png

Yes

plugin icon

mappings.json

Yes

mapping of dataloader Items to Squirro fields

onedrive_plugin.py

Yes

Core code

pipeline_workflow.json

OCC only

onedrive_plugin/pipeline_workflow.json

requirements.txt

Describes python dependencies

scheduling_options.json

OCC only

Defines sane scheduling defaults

When creating your own plugin replace onedrive from file/folder names with name corresponding to connector target service. Use alphanumeric, lowercase characters.

...

The file specifies general information about a plugin. Title, description or category are described here. It also specifies which files should be loaded to authorization, scheduling etc.

...

mappings.json

Code Block
languagejson
{
    "createdBy.user.displayNamemap_id": "id",
{    "map_title": "name",
    "namemap_created_at": "creatorcreatedDateTime",
  
     "displaymap_file_name": "Creatorname",
        "visible": true,
"map_file_mime": "file.mimeType",
       "searchable"map_file_data": true"content",
        "typeahead"map_url": true"webUrl",
        "analyzed"facets_file": true
    }
}

It creates and specifies what facets should be use in the plugin. More information about facets: Managing Facets

Pay close attention on how you set up the facets.

  • Each added facets increases the index size and slows down query times.

  • Use only a few which can be useful for the end-user and be consistent across plugins.

  • If a facet is only needed for filtering or simple aggregations, set analyzed to false to reduce size / performance impact.

  • Check if similar facets were already used in existing plugins and try to use the same naming convention e.g. Author or Owner facet in new plugin could be treated also as Creator in others

mappings.json

Code Block
languagejson
{"facets.json"
}

The file is used to set the mapping of various fields coming from source to corresponding Squirro item fields. In that place also a file containing the facets can be specified.

facets.json

Code Block
languagejson
{
    "createdBy.user.displayName": {
        "map_idname": "idcreator",
        "mapdisplay_titlename": "nameCreator",
    "map_created_at    "visible": "createdDateTime",true,
        "map_file_namesearchable": "name",true,
        "map_file_mimetypeahead": "file.mimeType"true,
    "map_file_data": "content",     "map_urlanalyzed": true
"webUrl",     "facets_file": "facets.json"
}

...

}
}

It creates and specifies what facets should be use in the plugin. More information about facets: Managing Facets

Pay close attention on how you set up the facets.

  • Each added facets increases the index size and slows down query times.

  • Use only a few which can be useful for the end-user and be consistent across plugins.

  • If a facet is only needed for filtering or simple aggregations, set analyzed to false to reduce size / performance impact.

  • Check if similar facets were already used in existing plugins and try to use the same naming convention e.g. Author or Owner facet in new plugin could be treated also as Creator in others

pipeline_workflow.json

Code Block
languagejson
{
    "steps": [
        {
            "config": {
                "policy": "replace"
            },
            "id": "deduplication",
            "name": "Deduplication",
            "type": "deduplication"
        },
        {
            "config": {
                "fetch_link_content": false
            },
            "id": "content-augmentation",
            "name": "Content Augmentation",
            "type": "content-augmentation"
        },
        {
            "config": {},
            "id": "content-conversion",
            "name": "Content Extraction",
            "type": "content-conversion"
        },
        {
            "id": "language-detection",
            "name": "Language Detection",
            "type": "language-detection"
        },
        {
            "id": "cleanup",
            "name": "Content Standardization",
            "type": "cleanup"
        },
        {
            "id": "index",
            "name": "Indexing",
            "type": "index"
        },
        {
            "id": "cache",
            "name": "Cache Cleaning",
            "type": "cache"
        }
    ]
}

...

The file lists all the required packages, with one package dependency per line. More information about data loader dependencies: Data loader plugin dependencies

...

e file:

Code Block
languagepy
config = get_injected("config")
auth_tools.configure_oauth2_lib(config)

client_id = config.get("dataloader", "onedrive_client_id", fallback=None)
client_secret = config.get("dataloader", "onedrive_client_secret", fallback=None)

if not client_id or not client_secret:
    log.warning("Client keys are missing in %s plugin", target_name)

...

Code Block
languagepy
def getJobId(self) -> str:
    """Generate a stable ID that changes with the main parameters."""
    m = hashlib.blake2b(digest_size=20)
    for v in (
        __plugin_name__,
        __version__,
        self.arg_index_all,
        self.arg_file_size_limit,
        self.arg_batch_size_limit,
        self.arg_download_media_files,
        self.arg_access_id,
    ):
        m.update(repr(v).encode())
    job_id = base64.urlsafe_b64encode(m.digest()).rstrip(b"=").decode()
    log.debug("Job ID: %r", job_id)
    return job_id

In case of many Since the end users can configure multiple data sources each of them should be somehow distinguishwith this plugin, we need to be able to distinguish them. The key reasons is to keep state and caching information of each instance separated. Therefore, this method is used to define a unique ID for the actual job. To provide that unique ID as many custom parameters as possible should be used. The common approach is used all arguments which can be set up by the user. In case of one click-connector passing refresh token or access token (arg_access_id in the above code) is one of the best option to provide unique argument to generate stable ID.

...