Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update to 2.4.5 data source usage

...

Code Block
languagejs
{
    "sources": {
        "demo": {
            "dsnsource_type": "csv:///",
            "source_file": "demo.csv",
        }
    },

    "testing": {
        "fixtures_dir": "fixtures/",
        "snapshots_dir": "snapshots/"
    }
}

...

Code Block
languagejs
{
    sources: {
        demo: {
            // The data is in a CSV file in the current directory
            dsn"source_type": "csv:///",
            "source_file": "demo.csv",
        },
    },

    /* Just repeating the default settings for testing here */
    testing: {
        fixtures_dir: "fixtures/",
        snapshots_dir: "snapshots/",
    }
}

...

The kee section configures a few basic parameters of the whole process. The full section is optional.

KeyData TypeDescription
pipeletString

The name of the pipelet when uploading the KEE project to the Squirro server.

This should be unique across your KEE projects to avoid any collision.

databaseString

File name, relative to the configuration file directory, where the lookup database is located.

Default: "db/lookup.json"

versionString

A version indicator for the current KEE configuration. This can be modified whenever the strategy or the data changes significantly, thus warranting a re-tagging of previous items.

The kee rerun command makes use of this information to select older items for re-tagging.

Default: not set

version_keywordString

Name of the keyword to use for version tagging on items. If the version is set, then it is written into a keyword of this name on the item.

Default: "KEE Version"

debugBoolean

If true, a few debug options are enabled. For example the lookup database is written in a more human-readable format.

Default: false

example usage: 


Code Block
languagejs
{
    "kee": {
        "database": "db/product_list.json",
        "debug": false
    }
}

...

This section is optional if you are not using the commands that connect to Squirro (get_fixture or upload).

KeyData TypeDescription
clusterString

The endpoint where Squirro has been installed.

tokenStringThe authentication token with which to log into the system. This token should be treated confidentially, so if the config file is shared, the token should not be included there. See the environment variables section for an alternative.
project_idStringProject identifier where the Known Entity Extraction is being used.

example usage:

Code Block
languagejs
{
    "sources": {
        "cluster": "http://www.mysquirrocluster.com",
        "token": "MY-ACCESS-TOKEN",
        "project_id": "MY-PROJECT-ID" 
    }
}

...

Each source takes the configuration.

...

This contains first the actual data connection. For this, the Data Loader options, including Data Loader Plugins can be used. The second part is then the KEE behaviour for that particular source.

KeyData TypeDescription
Data Source
source_typeString

Which source type to connect to. Valid values:

csv, excel, database.

A connection string for where to find the original data. At the moment, only the csv protocol is supported - others may be

added in the future.Example: csv:///clients.csv

added in the future.

Example: csv:///clients.csv

source_scriptStringThe Data Loader Plugin to load. This and the source_type are mutually exclusive.
...

Additional connection options are specified as key/value pairs as well.

Use the same options as for the data loader, with the dashes replaced with underscores. For example if the data loader is invoked with squirro_data_load --source-type csv --source-file test.csv, then the source configuration in KEE is:

Code Block
languagejs
{
    "source_type": "csv",
    "source_file": "test.csv",
}

Check the Data Loader Reference for all the possible options. Plugin-specific options are documented in each plugin.

KEE Configuration
strategyStringThe name of the strategy to use for matching on this data source. This needs to reference to a strategy key that has been defined in the Strategies section of the config file.
field_idStringThe name of the column in the input data which is the unique identifier of a row.
generate_idBooleanAutomatically generate unique identifiers for rows. Can be used if (and only if) field_id is not specified.
field_matchingListNames of the columns of the input data that contain the object names used for the KEE matching. This is generally the primary name and often an alias column.
hierarchyString

Specifies a hierarchy in the data. This hierarchy can be used in the tagging for example to tag an item with the matching company and all of the parent companies as well.

The format of this configuration is Parent Column -> Child Column

See the hierarchy section for examples.

multivalueList
A list of column names that can contain multiple values. This is commonly used for the alias column.

The default separator for multiple values in the source data is the pipe (|) but this can be changed by specifying the separator after a colon.

Example:

Code Block
languagejs
{
    "sources": {
        "companies": {
            // Other keys omitted for clarity
            …
            "multivalue": [
                "Aliases",  // Aliases are multiple values, separate with pipe |
                "Sectors:,",  // Sectors are multiple values, separate with comma ,
            ]
        }
    }
}

Testing

The optional testing section allows specification of where the tests files and snapshots are located. See KEE Testing for details on the testing process.

...

The valid configuration keys in this section are:

KeyData TypeDescription
fixtures_dirString

Folder name, relative to the configuration file directory, where the test fixtures are located.

Default: "fixtures"

snapshots_dirString

Folder name, relative to the configuration file directory, where the snapshots are stored.

Default: "snapshots"

Extraction

The section extraction configures how the Squirro items are processed during the KEE process.

KeyData TypeDescription
item_fieldsList

Which item fields to use for Known Entity Extraction. See Item Format for possible values.

Default: ["title", "body"]

Strategies

A strategy in KEE defines how the mapping is executed and which keywords should be added to Squirro items upon a successful match.

...

Code Block
languagejs
{
    // Most keys omitted for brevity

    "sources": {
        "clients": {
            "source_type": "csv",
            "dsnsource_file": "…",
            "strategy": "companies"
        },
    },

    "strategies": {
        "companies": {
            "tokenizer": "default",
            // …
        },
    },
}

The following table describes all the configuration keys with which a strategy can be tailored to specific requirements.

KeyData TypeDescription
Matching
tokenizerString

For processing the text input, the text is split into individual tokens. The tokenizer and the filters specify how this is done.

Supported tokenizers:

  • default: this splits the input on common word boundaries, sentences, etc.
  • brackets: removes any trailing brackets. This may be useful in a context where the entity names have descriptions in brackets or parentheses that should be ignored, e.g. "Acme Inc. (Parts supplier)".
filtersList

Together with the tokenizer, the filters specify how text is matched. The filters influence how much leniency is applied when matching and makes sure that different spellings of a word can still be matched.

Available filters are:

  • camelcase: Return one token for each camel case component. Camel case is the concept of mixing upper and lower case letters in the same word e.g. TechCrunch, JPMorgan, etc.). By applying this filter the matching will not distinguish between writing those together or separately, so that "JP Morgan", "JPMorgan" or "JpMorgan" all correctly match the entity "JPMorgan". To use this filter, it must be listed before the lowercase filter.
  • initials: Combines one-letter initials together. This way writing "JP Morgan", "J & P Morgan" or "J.P. Morgan" all have the same effect. To use this filter, it must be listed before the lowercase filter.
  • lowercase: Converts the text into lowercase, thus making the matching case insensitive.
  • singular: A very basic singular filter that works by removing trailing s-letters from longer words. When this is used, writing "MacDonald" and "MacDonalds" has the same effect.
  • accents: Normalize accents and umlauts. When using this, "Crédit Agricole" and "Credit Agricole" will match each-other.
  • stem: Uses a porter filter to change a word to its stem (e.g. 'waiting' → 'wait'). This is generally not recommended for data sources with proper names, but may be useful for generic language concepts.

Default: by default only the lowercase filter is applied.

min_scoreFloat

How good a score is required for a token to match. 1.0 is a perfect match, 0.0 is no match at all.

Use KEE Testing to find the right balance for each use case.

Turning on verbose logging or tracing (see the --trace argument of kee test) to see the score that tokens receive.

Default: 0.9

spellfixBoolean

Allow small spelling mistakes. This allows at most one letter swap, so e.g. "Apple" and "Appel" will both match each-other.

Default: false

blacklistList

A list of entity names to ignore. If any of the match_field (see sources) column values is contained in this list, the entity is never tagged.

suffix_listStringThe suffix list that is used to remove common suffixes in the entity names. See the section suffix list below for details.
geo_strategyString

How to deal with geographic names in entity names. Possible values:

  • noop (the default): Don't handle geographic names at all.
  • ignore: Detect common geographic names and ignore them for matching. This especially affects trailing geographic names and means that a company designation like "Acme Inc" is matched the same way as e.g. "Acme Inc - Switzerland".
Keywords
keywordsList

The keywords section defines which keywords are added to a Squirro item based on any matching entity. This is a list of keywords that can be added, where each individual entry contains the input file column to write and the keyword name into which to store it.

The target value can make use of simple template substitution to add keyword names based on the data of the matching row. The syntax is a field name surrounded by curly brackets.

Example:

Code Block
languagejs
…
    "keywords": [
        "Name",  // Takes the "Name" column and writes it into a "Name" keyword
        "Id -> Company ID",  // Writes the "Id" column into a "Company ID" keyword
        // Adds the "Id" a second time. If the "Type" of the match is e.g. "SME" then
        // this adds the keyword "SME ID".
        "Id -> {Type} ID",
    ]
…
parent_keywordsList

The same setting as keywords but for all the parents.

This recursively processes all parent entities (if any) and sets keywords on the item based on these rules.

For this to work there must be a hierarchy parameter on the data source. See the hierarchy section for examples.

clean_keywordsList

A list of keywords that should be removed from the items before applying the KEE tagging.

This is useful when re-running KEE tagging to ensure that old keywords are removed.

Example:

Code Block
languagejs
…
    "clean_keywords": ["Name", "Company ID"],

    "keywords": [
        "Name", "Id -> Company ID",
    ]
…

Suffix list

The suffix list is used by the strategy to ignore common suffixes. Examples for such suffixes:

...

The following is a reference of all of the keys in an individual ngram section.

KeyData TypeDescription
sourceString

Folder name, relative to the configuration file directory, where the ngram database is located.

Please contact support to get access to Squirro's pre-compiled language models.

default_languageString

Sets the default language for language model lookups. When the ngram folder does not contain a model for the language of the Squirro item that is being processed, then the default language is read.

Default: en

whitelistList

A list of company names for which ngram correction is not done. This can be done in some corner cases where a lax match is desired, even though a company name is penalized from the language model.

Example:

Code Block
languagejs
// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "whitelist": ["Apple"],
    }
},
commonList

A list of prefixes that should be treated as common language terms. This can be used to overwrite the language model to be more strict about certain words. This is sometimes necessary to overwrite imprecise matching for prefix words from other languages. For example "Svensk" is the Swedish word for "Swedish". So any company that starts with  "Svensk" may just be saying "Swedish Acme Corp." and thus this shouldn't yet match just based on "Svensk" in any text.

The following example snippet takes care of this problem:

Code Block
languagejs
// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "common": ["Svensk"],
    }
},

Environment Variables

The settings in the squirro section can also be set through environment variables. That is especially helpful to avoid writing the token into the config file.

...

Code Block
languagejs
titleconfig.json
{
    "sources": {
        "demo": {
            "dsnsource_type": "csv:///",
            "source_file": "hierarchy.csv",

            "strategy": "demo",
            "multivalue": "Aliases",
            "field_id": "Id"
            "field_matching": ["Name", "Aliases"]

            // The data is hierarchical, with the children declaring their
            // parent (ParentId field points to a valid Id from another row).
            "hierarchy": "ParentId->Id",
        }
    },

    "strategies": {
        "demo": {
            // Score at which the hit is a good one
            "min_score": 0.6,

            // Depending on Type we assign different keywords
            "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
        },
    },
}