Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Current »

This document explains the configuration of Known Entity Extraction (KEE) through the studio plugin Known Entity Extraction exposed in the user interface under the AI STUDIO tab in the Setup space.

This tutorial guides you through the setup of KEE using the studio plugin.

Table of Contents

Overview

Known Entity Extraction is the Squirro technology to enrich unstructured data by linking it to structured information. Examples are identifying company names or products in indexed Squirro data.

Configuration

To set up a KEE, open the Setup space in your Squirro project, and navigate to AI STUDIO → Known Entity Extraction. Press the plus button in the top right corner to configure a new KEE.

You can specify the following configuration options. Only the keys marked with a star (*) are mandatory. The KEE config.json name in the table refers to the internal config key as documented on KEE Config Reference.

Configuration

Description

KEE config.json name

Name *

The name of the KEE enrichment. Must be unique on the entire server, and will overwrite any existing enrichment with the same name.

kee.pipelet

KEE data *

CSV or Excel file containing the structured information. The first row must contain column headers. The columns are referred to as fields in the configuration options below.

There must be no duplicates in the rows of the CSV/Excel.

Example:

sources[<source_name>].source_file

(source_name defaults to "upload_source")

ID field

Field that is used as the unique ID of each records. IDs are auto-generated when this is left empty.

sources[<source_name>].field_id

sources[<source_name>].generate_id

Matching fields

Fields from the input KEE data on which the match is executed. Typically the name field, for example the field holding the company name.

sources[<source_name>].field_matching

Keywords to assign

Fields for which you want to assign keywords (facets) and tag matched items. Provide each field for which you want to assign a keyword on a separate line. Use the arrow (->) notation to set the name of the keyword to a different name than the field.

For example:

Name -> company
industry

This keywords configuration will assign the Name field from the source data to the keyword company. The field industry is assigned to the keyword industry.

Note, the keyword is automatically created if it is not yet existing.

strategies[<strategy_name>].keywords

(strategy_name defaults to "basic")

Minimum score for matches

The minimum score at which a match is considered. Can be any value between 0 and 1, such as 0.5, 0.9 or 1.0.

strategies[<strategy_name>].min_score

Enable fuzzy matching

Allow small spelling mistakes. This allows at most one letter swap, so e.g. "Apple" and "Appel" will both match each-other.

strategies[<strategy_name>].spellfix

Enable company suffix list

Defines a company-specific suffix list which removes common company suffixes when matching company names.

strategies[<strategy_name>].suffix_list

Enable ngram database

Enables a default ngram database to improve matching precision for common English terms.

strategies[<strategy_name>].ngram

ngram[default]

(The ngram name is always default)

Config (JSON)

JSON dictionary to customize configuration values. See example below.

Limitations

The following limitations are known for this UI integration of KEE.

  • Single source only. The advanced version of KEE (on the command line) supports multiple KEE sources in one KEE configuration. This is not supported in the user interface, so only one source at a time can be included. It is easy to work around this by creating multiple KEE configurations, though.

  • Whenever editing the Known Entity Extraction, the original file has to be uploaded again. The file is not currently persisted in its raw form on the server.

  • The created pipelet can not be removed. Even when the KEE definition is removed, the pipelet stays around.

  • Only items indexed after the KEE enrichment has been set up are tagged. That is a general limitation with all limitations and can be worked around by using the Rerun functionality.

Customisation

The Config (JSON) field can be filled with a KEE JSON configuration dictionary. If it is defined, then all the configuration values mentioned above in the Configuration section are overwritten, but otherwise the config is used as is. This allows for advanced customisations.

For example, the following configuration can be used to provide custom versioningfilters, cleaning keywords before rerunning, specifying item_fields, keyworods/facets to run on.

{
    "kee": {
        "version": "2",
        "version_keyword": "kee_companies"
    },
    "strategies": {
        "basic": {
            "filters": ["camelcase", "lowercase", "initials"],
            "clean_keywords": ["company_name", ...],
        },
    },
    "extraction": {
        "item_fields": [
            "title", 
            "body", 
            "abstract", 
            "summary",
            "keywords.your_facet"
        ]
    },
}

  • No labels