Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Known Entity Extraction (KEE) is configured using a configuration file in the root of the KEE project. That file is called config.json and is written in the JSON format.

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Format

JSON

The format used by the config file is JSON, or rather a more human-friendly superset called Hjson. First, an example of a fully valid JSON file which is also valid Hjson:

Code Block
languagejs
{
    "sources": {
        "demo": {
            "source_type": "csv",
            "source_file": "demo.csv"
        }
    },

    "testing": {
        "fixtures_dir": "fixtures/",
        "snapshots_dir": "snapshots/"
    }
}

HJson

With Hjson it is possible to write this file in a more human-friendly way. The main improvements are comments and trailing commas (which are not allowed in standard JSON). Additionally most of the quotes can be left out if so desired.

Code Block
languagejs
{
    sources: {
        demo: {
            // The data is in a CSV file in the current directory
            "source_type": "csv",
            "source_file": "demo.csv",
        },
    },

    /* Just repeating the default settings for testing here */
    testing: {
        fixtures_dir: "fixtures/",
        snapshots_dir: "snapshots/",
    }
}

A full reference of Hjson is available on hjson.org.

Structure

In the examples above the curly brackets at the root of the file open a new dictionary. Within each dictionary there are a number of key / value pairs. For example, the keys of the top level dictionary shown above are sources and testing.

Each entry in the top-level dictionary indicates a different section of the KEE configuration. This reference describes the usage of each of the different sections below.

Reference

KEE

The kee section configures a few basic parameters of the whole process. The full section is optional.

...

The name of the pipelet when uploading the KEE project to the Squirro server.

This should be unique across your KEE projects to avoid any collision.

...

File name, relative to the configuration file directory, where the lookup database is located.

Default: "db/lookup.json"

...

A version indicator for the current KEE configuration. This can be modified whenever the strategy or the data changes significantly, thus warranting a re-tagging of previous items.

The kee rerun command makes use of this information to select older items for re-tagging.

Default: not set

...

Name of the keyword to use for version tagging on items. If the version is set, then it is written into a keyword of this name on the item.

Default: "KEE Version"

...

If true, a few debug options are enabled. For example the lookup database is written in a more human-readable format.

Default: false

example usage:

Code Block
languagejs
{
    "kee": {
        "database": "db/product_list.json",
        "debug": false
    }
}

Squirro

To add a KEE project to a Squirro project, some of the cluster information needs to be configured in the squirro section.

See Connecting to Squirro for how to get this information.

This section is optional if you are not using the commands that connect to Squirro (get_fixture or upload).

...

The endpoint where Squirro has been installed.

...

Example usage:

Code Block
languagejs
{
    "squirro": {
        "cluster": "https://www.mysquirrocluster.com",
        "token": "MY-ACCESS-TOKEN",
        "project_id": "MY-PROJECT-ID" 
    }
}

Sources

The sources section in the configuration contains a list of all the data sources for building the lookup database.

Each entry is in itself an object with the required configuration for the source. A partial example:

Code Block
languagejs
{
    "sources": {
        "clients": {
            // The keys  from the reference below come here
        },
        "employees": {
            // The keys  from the reference below come here
        },
        … 
    }
}

Each source takes the configuration. This contains first the actual data connection. For this, the Data Loader options, including Data Loader Plugins can be used. The second part is then the KEE behaviour for that particular source.

...

Which source type to connect to. Valid values:

csv, excel, database.

...

Additional connection options are specified as key/value pairs as well.

Use the same options as for the data loader, with the dashes replaced with underscores. For example if the data loader is invoked with squirro_data_load --source-type csv --source-file test.csv, then the source configuration in KEE is:

Code Block
languagejs
{
    "source_type": "csv",
    "source_file": "test.csv",
}

Check the Data Loader Reference for all the possible options. Plugin-specific options are documented in each plugin.

...

Specifies a hierarchy in the data. This hierarchy can be used in the tagging for example to tag an item with the matching company and all of the parent companies as well.

The format of this configuration is Parent Column -> Child Column

See the hierarchy section for examples.

...

The default separator for multiple values in the source data is the pipe (|) but this can be changed by specifying the separator after a colon.

Example:

Code Block
languagejs
{
    "sources": {
        "companies": {
            // Other keys omitted for clarity
            …
            "multivalue": [
                "Aliases",  // Aliases are multiple values, separate with pipe |
                "Sectors:,",  // Sectors are multiple values, separate with comma ,
            ]
        }
    }
}

Testing

The optional testing section allows specification of where the tests files and snapshots are located. See KEE Testing for details on the testing process.

Again a partial example for this:

Code Block
languagejs
{
    "testing": {
        "fixtures_dir": "fixtures/",
        "snapshots_dir": "snapshots/"
    }

    … // Other keys omitted
}

The valid configuration keys in this section are:

KeyData TypeDescription
fixtures_dirString

Folder name, relative to the configuration file directory, where the test fixtures are located.

Default: "fixtures"

snapshots_dirString

Folder name, relative to the configuration file directory, where the snapshots are stored.

Default: "snapshots"

Extraction

The section extraction configures how the Squirro items are processed during the KEE process.

...

Which item fields to use for Known Entity Extraction. See Item Format for possible values.

Default: ["title", "body"]

Strategies

A strategy in KEE defines how the mapping is executed and which keywords should be added to Squirro items upon a successful match.

The matching is modified depending on a number of factors, including:

  • How precise should the matching be? In certain cases false positives (matches that shouldn't have been done) are a smaller problem than false negatives (matches that were missed). In other cases it's the opposite.
  • The format of the incoming data also matters. If the input data is all uppercase (as often happens in legacy data) the KEE matching has less precision to work with. If the input data is known to be of high quality, then signals like camel case can be taken into account.
  • The data domain. For example company names have many suffixes that are often not spelled out in common language (e.g. Inc., Limited, (Pty) Ltd, etc.).

The strategy is referenced in the input data and based on that name looked up in the strategies configuration. The following incomplete example references a strategy called "companies" that is correspondingly defined in the strategies section.

Code Block
languagejs
{
    // Most keys omitted for brevity

    "sources": {
        "clients": {
            "source_type": "csv",
            "source_file": "…",
            "strategy": "companies"
        },
    },

    "strategies": {
        "companies": {
            "tokenizer": "default",
            // …
        },
    },
}

The following table describes all the configuration keys with which a strategy can be tailored to specific requirements.

...

For processing the text input, the text is split into individual tokens. The tokenizer and the filters specify how this is done.

Supported tokenizers:

  • default
  • brackets

Please refer to KEE Tokenizers and Filters for details on the tokenizers.

...

Together with the tokenizer, the filters specify how text is matched. The filters influence how much leniency is applied when matching and makes sure that different spellings of a word can still be matched.

Available filters are:

  • camelcase
  • initials
  • lowercase
  • singular
  • accents
  • stem

Default: by default only the lowercase filter is applied.

Please refer to KEE Tokenizers and Filters for details on the filters. That section also explains how to create custom filters.

...

How good a score is required for a token to match. 1.0 is a perfect match, 0.0 is no match at all.

Use KEE Testing to find the right balance for each use case.

Turning on verbose logging or tracing (see the --trace argument of kee test) to see the score that tokens receive.

Default: 0.9

...

Allow small spelling mistakes. This allows at most one letter swap, so e.g. "Apple" and "Appel" will both match each-other.

Default: false

...

A list of entity names to ignore. If any of the match_field (see sources) column values is contained in this list, the entity is never tagged.

...

How to deal with geographic names in entity names. Possible values:

  • noop (the default): Don't handle geographic names at all.
  • ignore: Detect common geographic names and ignore them for matching. This especially affects trailing geographic names and means that a company designation like "Acme Inc" is matched the same way as e.g. "Acme Inc - Switzerland".

...

The keywords section defines which keywords are added to a Squirro item based on any matching entity. This is a list of keywords that can be added, where each individual entry contains the input file column to write and the keyword name into which to store it.

The target value can make use of simple template substitution to add keyword names based on the data of the matching row. The syntax is a field name surrounded by curly brackets.

Example:

Code Block
languagejs
…
    "keywords": [
        "Name",  // Takes the "Name" column and writes it into a "Name" keyword
        "Id -> Company ID",  // Writes the "Id" column into a "Company ID" keyword
        // Adds the "Id" a second time. If the "Type" of the match is e.g. "SME" then
        // this adds the keyword "SME ID".
        "Id -> {Type} ID",
    ]
…

...

The same setting as keywords but for all the parents.

This recursively processes all parent entities (if any) and sets keywords on the item based on these rules.

For this to work there must be a hierarchy parameter on the data source. See the hierarchy section for examples.

...

A list of keywords that should be removed from the items before applying the KEE tagging.

This is useful when re-running KEE tagging to ensure that old keywords are removed.

Example:

Code Block
languagejs
…
    "clean_keywords": ["Name", "Company ID"],

    "keywords": [
        "Name", "Id -> Company ID",
    ]
…

Suffix list

The suffix list is used by the strategy to ignore common suffixes. Examples for such suffixes:

  • Companies: Names for enterprises legally end with Inc., Pty, AG, a.s., …
  • Stock ticker symbols: These may often include also the stock exchange, as as NASDAQ, NYSE, etc

These suffixes may often be omitted when writing about said entities and thus the can be ignored for matching.

To create a custom suffix list, add it to the suffix_list section and define the various patterns as key/value pairs. The keys are currently ignored by the KEE extraction and can be used to group the tokens into countries or other logical groupings.

An example:

Code Block
1js
languagejs
{
    // Most keys omitted for clarity

    "strategies": {
        "orgs": {
            "suffix_list": "companies",
        }
    },

    "suffix_list": {
        "companies": {
            "GLOBAL": ["Inc", "Limited"]
            "DEU": ["AG", "GmbH"]
            "ZAF": ["(Pty) Ltd", "LIMITED"]
            "INDUSTRY": ["Bank"]
        }
    },
}

Language handling (ngram)

For improved KEE matching, Squirro can make use of a language model. That allows the matching to handle common words in entity names correctly. Two simple examples will show the possibilities:

  • The Capital Group: this is a company that contains only quite common words. So if a text talks about "the capital" it shouldn't match this company yet. But when the name is fully spelled out, then the entity should match.
  • Carrefour Group: The word "Carrefour" is sufficiently rare in common language usage that just writing "Carrefour" in a text should be enough to match this entity.

The required frequency model is created using the ngram definition. Please contact support to get access to Squirro's pre-compiled language models.

An example configuration for ngram is as follows:

Code Block
languagejs
{
    // Most keys omitted for clarity

    "strategies": {
        "orgs": {
            "ngram": "companies",
        }
    },

    "ngram": {
        "companies": {
            "source": "ngram/",
            "whitelist": ["Apple"],
        }
    },
}

The following is a reference of all of the keys in an individual ngram section.

...

Folder name, relative to the configuration file directory, where the ngram database is located.

Please contact support to get access to Squirro's pre-compiled language models.

...

Sets the default language for language model lookups. When the ngram folder does not contain a model for the language of the Squirro item that is being processed, then the default language is read.

Default: en

...

A list of company names for which ngram correction is not done. This can be done in some corner cases where a lax match is desired, even though a company name is penalized from the language model.

Example:

Code Block
languagejs
// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "whitelist": ["Apple"],
    }
},

...

A list of prefixes that should be treated as common language terms. This can be used to overwrite the language model to be more strict about certain words. This is sometimes necessary to overwrite imprecise matching for prefix words from other languages. For example "Svensk" is the Swedish word for "Swedish". So any company that starts with  "Svensk" may just be saying "Swedish Acme Corp." and thus this shouldn't yet match just based on "Svensk" in any text.

The following example snippet takes care of this problem:

Code Block
languagejs
// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "common": ["Svensk"],
    }
},

Environment Variables

The settings in the squirro section can also be set through environment variables. That is especially helpful to avoid writing the token into the config file.

This reference can not go into details on how to set environment variables. Please consult the documentation of your system, such as Bash or Windows PowerShell, for documentation on environment variables.

The environment variables that are respected are:

  • SQ_CLUSTER
  • SQ_TOKEN
  • SQ_PROJECT_ID

Custom KEE Pipelet

The KEE pipelet can be extended to allow further control over the KEE processing of the items. The following template can be used:

Code Block
languagepy
import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
	def consume(self, item):
		# Pre-processing of item goes here

		# KEE processing
		item = super(CustomKEEPipelet, self).consume(item)

		# Post-processing of item goes here

		return item

Examples

The following sections give a few examples for how to achieve common use cases.

Hierarchy

Hierarchies are created using the hierarchy setting on a source. Tagging of hierarchies is achieved using the parent_keywords setting in the strategy.

The input data here is a CSV file with the following contents (top 3 lines only):

Code Block
languagetext
titlehierarchy.csv
Id,Name,Aliases,ParentId
1,Apple Inc.
2,Google Inc.,Googl|Goog,3
3,Alphabet Inc.,abc.xyz

The KEE configuration file that makes full use of this data can look as follows:

Code Block
languagejs
titleconfig.json
{
    "sources": {
        "demo": {
            "source_type": "csv",
            "source_file": "hierarchy.csv",

            "strategy": "demo",
            "multivalue": "Aliases",
            "field_id": "Id"
            "field_matching": ["Name", "Aliases"]

            // The data is hierarchical, with the children declaring their
            // parent (ParentId field points to a valid Id from another row).
            "hierarchy": "ParentId->Id",
        }
    },

    "strategies": {
        "demo": {
            // Score at which the hit is a good one
            "min_score": 0.6,

            // Depending on Type we assign different keywords
            "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
        },
    },
} 

Custom KEE Pipelet

You might want to exclude part of the item body when running KEE. This can be achieved in a custom KEE pipelet by modifying the item before and after running KEE. This example runs on the first 100 words in the body.

...

languagepy

...

This page can now be found at KEE Configuration on the Squirro Docs site.