Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: clean_keywords

...

KeyData TypeDescription
Matching
tokenizerString

For processing the text input, the text is split into individual tokens. The tokenizer and the filters specify how this is done.

Supported tokenizers:

  • default: this splits the input on common word boundaries, sentences, etc.
  • brackets: removes any trailing brackets. This may be useful in a context where the entity names have descriptions in brackets or parentheses that should be ignored, e.g. "Acme Inc. (Parts supplier)".
filtersList

Together with the tokenizer, the filters specify how text is matched. The filters influence how much leniency is applied when matching and makes sure that different spellings of a word can still be matched.

Available filters are:

  • camelcase: Return one token for each camel case component. Camel case is the concept of mixing upper and lower case letters in the same word e.g. TechCrunch, JPMorgan, etc.). By applying this filter the matching will not distinguish between writing those together or separately, so that "JP Morgan", "JPMorgan" or "JpMorgan" all correctly match the entity "JPMorgan". To use this filter, it must be listed before the lowercase filter.
  • initials: Combines one-letter initials together. This way writing "JP Morgan", "J & P Morgan" or "J.P. Morgan" all have the same effect. To use this filter, it must be listed before the lowercase filter.
  • lowercase: Converts the text into lowercase, thus making the matching case insensitive.
  • singular: A very basic singular filter that works by removing trailing s-letters from longer words. When this is used, writing "MacDonald" and "MacDonalds" has the same effect.
  • accents: Normalize accents and umlauts. When using this, "Crédit Agricole" and "Credit Agricole" will match each-other.
  • stem: Uses a porter filter to change a word to its stem (e.g. 'waiting' → 'wait'). This is generally not recommended for data sources with proper names, but may be useful for generic language concepts.

Default: by default only the lowercase filter is applied.

min_scoreFloat

How good a score is required for a token to match. 1.0 is a perfect match, 0.0 is no match at all.

Use KEE Testing to find the right balance for each use case.

Turning on verbose logging or tracing (see the --trace argument of kee test) to see the score that tokens receive.

Default: 0.9

spellfixBoolean

Allow small spelling mistakes. This allows at most one letter swap, so e.g. "Apple" and "Appel" will both match each-other.

Default: false

blacklistList

A list of entity names to ignore. If any of the match_field (see sources) column values is contained in this list, the entity is never tagged.

suffix_listStringThe suffix list that is used to remove common suffixes in the entity names. See the section suffix list below for details.
geo_strategyString

How to deal with geographic names in entity names. Possible values:

  • noop (the default): Don't handle geographic names at all.
  • ignore: Detect common geographic names and ignore them for matching. This especially affects trailing geographic names and means that a company designation like "Acme Inc" is matched the same way as e.g. "Acme Inc - Switzerland".
Keywords
keywordsList

The keywords section defines which keywords are added to a Squirro item based on any matching entity. This is a list of keywords that can be added, where each individual entry contains the input file column to write and the keyword name into which to store it.

The target value can make use of simple template substitution to add keyword names based on the data of the matching row. The syntax is a field name surrounded by curly brackets.

Example:

Code Block
languagejs
…
    "keywords": [
        "Name",  // Takes the "Name" column and writes it into a "Name" keyword
        "Id -> Company ID",  // Writes the "Id" column into a "Company ID" keyword
        // Adds the "Id" a second time. If the "Type" of the match is e.g. "SME" then
        // this adds the keyword "SME ID".
        "Id -> {Type} ID",
    ]
…
parent_keywordsList

The same setting as keywords but for all the parents.

This recursively processes all parent entities (if any) and sets keywords on the item based on these rules.

For this to work there must be a hierarchy parameter on the data source. See the hierarchy section for examples.

clean_keywordsList

A list of keywords that should be removed from the items before applying the KEE tagging.

This is useful when re-running KEE tagging to ensure that old keywords are removed.

Example:

Code Block
languagejs
…
    "clean_keywords": ["Name", "Company ID"],

    "keywords": [
        "Name", "Id -> Company ID",
    ]
…

Suffix list

The suffix list is used by the strategy to ignore common suffixes. Examples for such suffixes:

...