Page Comparison

...

The required frequency model is created using the ngram definition. Please contact support to get access to Squirro's pre-compiled language models.

...

Key

Data Type

Description

source

String

Folder name, relative to the configuration file directory, where the ngram database is located.

Please contact support to get access to Squirro's pre-compiled language models.

default_language

String

Sets the default language for language model lookups. When the ngram folder does not contain a model for the language of the Squirro item that is being processed, then the default language is read.

Default: en

whitelist

List

A list of company names for which ngram correction is not done. This can be done in some corner cases where a lax match is desired, even though a company name is penalized from the language model.

Example:

Code Block

language	js

// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "whitelist": ["Apple"],
    }
},

common

List

A list of prefixes that should be treated as common language terms. This can be used to overwrite the language model to be more strict about certain words. This is sometimes necessary to overwrite imprecise matching for prefix words from other languages. For example "Svensk" is the Swedish word for "Swedish". So any company that starts with "Svensk" may just be saying "Swedish Acme Corp." and thus this shouldn't yet match just based on "Svensk" in any text.

The following example snippet takes care of this problem:

Code Block

language	js

// …
"ngram": {
    "companies": {
        "source": "ngram/",
        "common": ["Svensk"],
    }
},

...

SQ_CLUSTER
SQ_TOKEN
SQ_PROJECT_ID

Custom KEE Pipelet

The KEE pipelet can be extended to allow further control over the KEE processing of the items. The following template can be used:

Code Block

language	py

import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
	def consume(self, item):
		# Pre-processing of item goes here

		# KEE processing
		item = super(CustomKEEPipelet, self).consume(item)

		# Post-processing of item goes here

		return item

Examples

The following sections give a few examples for how to achieve common use cases.

...

Code Block

language	js
title	config.json

{
    "sources": {
        "demo": {
            "source_type": "csv",
            "source_file": "hierarchy.csv",

            "strategy": "demo",
            "multivalue": "Aliases",
            "field_id": "Id"
            "field_matching": ["Name", "Aliases"]

            // The data is hierarchical, with the children declaring their
            // parent (ParentId field points to a valid Id from another row).
            "hierarchy": "ParentId->Id",
        }
    },

    "strategies": {
        "demo": {
            // Score at which the hit is a good one
            "min_score": 0.6,

            // Depending on Type we assign different keywords
            "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
     // Depending on Type},
we assign different keywords },
           "keywords": "Name -> Name",
            "parent_keywords": "Name -> Parent Name",
        },
    },
} }

Custom KEE Pipelet

You might want to exclude part of the item body when running KEE. This can be achieved in a custom KEE pipelet by modifying the item before and after running KEE. This example runs on the first 100 words in the body.

Code Block

language	py

import squirro.sdk.kee.pipelet

class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
	def consume(self, item):
		body = item['body']
		body_short = ' '.join(body.split(' ')[0:100])
		item['body'] = body_short
		item = super(CustomKEEPipelet, self).consume(item)
		item['body'] = body
		return item

Versions Compared

Old Version 29

New Version 30

Key

Custom KEE Pipelet

Examples

Custom KEE Pipelet