Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The required frequency model is created using the ngram definition. Please contact support to get access to Squirro's pre-compiled language models.

...

KeyData TypeDescription
sourceString

Folder name, relative to the configuration file directory, where the ngram database is located.

Please contact support to get access to Squirro's pre-compiled language models.

default_languageString

Sets the default language for language model lookups. When the ngram folder does not contain a model for the language of the Squirro item that is being processed, then the default language is read.

Default: en

whitelistList

A list of company names for which ngram correction is not done. This can be done in some corner cases where a lax match is desired, even though a company name is penalized from the language model.

Example:

Code Block
languagejs
// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "whitelist": ["Apple"],
    }
},


commonList

A list of prefixes that should be treated as common language terms. This can be used to overwrite the language model to be more strict about certain words. This is sometimes necessary to overwrite imprecise matching for prefix words from other languages. For example "Svensk" is the Swedish word for "Swedish". So any company that starts with  "Svensk" may just be saying "Swedish Acme Corp." and thus this shouldn't yet match just based on "Svensk" in any text.

The following example snippet takes care of this problem:

Code Block
languagejs
// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "common": ["Svensk"],
    }
},


...

  • SQ_CLUSTER
  • SQ_TOKEN
  • SQ_PROJECT_ID

Custom KEE Pipelet

The KEE pipelet can be extended to allow further control over the KEE processing of the items. The following template can be used:

Code Block
languagepy
import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
	def consume(self, item):
		# Pre-processing of item goes here

		# KEE processing
		item = super(CustomKEEPipelet, self).consume(item)

		# Post-processing of item goes here

		return item

Examples

The following sections give a few examples for how to achieve common use cases.

...

Code Block
languagejs
titleconfig.json
{
    "sources": {
        "demo": {
            "source_type": "csv",
            "source_file": "hierarchy.csv",

            "strategy": "demo",
            "multivalue": "Aliases",
            "field_id": "Id"
            "field_matching": ["Name", "Aliases"]

            // The data is hierarchical, with the children declaring their
            // parent (ParentId field points to a valid Id from another row).
            "hierarchy": "ParentId->Id",
        }
    },

    "strategies": {
        "demo": {
            // Score at which the hit is a good one
            "min_score": 0.6,

            // Depending on Type we assign different keywords
            "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
     // Depending on Type},
we assign different keywords },
           "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
        },
    },
} } 

Custom KEE Pipelet

You might want to exclude part of the item body when running KEE. This can be achieved in a custom KEE pipelet by modifying the item before and after running KEE. This example runs on the first 100 words in the body.

Code Block
languagepy
import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
	def consume(self, item):
		body = item['body']
		body_short = ' '.join(body.split(' ')[0:100])
		item['body'] = body_short
		item = super(CustomKEEPipelet, self).consume(item)
		item['body'] = body
		return item