...
The required frequency model is created using the ngram definition. Please contact support to get access to Squirro's pre-compiled language models.
...
Key | Data Type | Description | |||||
---|---|---|---|---|---|---|---|
source | String | Folder name, relative to the configuration file directory, where the ngram database is located. Please contact support to get access to Squirro's pre-compiled language models. | |||||
default_language | String | Sets the default language for language model lookups. When the ngram folder does not contain a model for the language of the Squirro item that is being processed, then the default language is read. Default: en | |||||
whitelist | List | A list of company names for which ngram correction is not done. This can be done in some corner cases where a lax match is desired, even though a company name is penalized from the language model. Example:
| |||||
common | List | A list of prefixes that should be treated as common language terms. This can be used to overwrite the language model to be more strict about certain words. This is sometimes necessary to overwrite imprecise matching for prefix words from other languages. For example "Svensk" is the Swedish word for "Swedish". So any company that starts with "Svensk" may just be saying "Swedish Acme Corp." and thus this shouldn't yet match just based on "Svensk" in any text. The following example snippet takes care of this problem:
|
...
SQ_CLUSTER
SQ_TOKEN
SQ_PROJECT_ID
Custom KEE Pipelet
The KEE pipelet can be extended to allow further control over the KEE processing of the items. The following template can be used:
Code Block | ||
---|---|---|
| ||
import squirro.sdk.kee.pipelet
class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
def consume(self, item):
# Pre-processing of item goes here
# KEE processing
item = super(CustomKEEPipelet, self).consume(item)
# Post-processing of item goes here
return item |
Examples
The following sections give a few examples for how to achieve common use cases.
...
Code Block | ||||
---|---|---|---|---|
| ||||
{ "sources": { "demo": { "source_type": "csv", "source_file": "hierarchy.csv", "strategy": "demo", "multivalue": "Aliases", "field_id": "Id" "field_matching": ["Name", "Aliases"] // The data is hierarchical, with the children declaring their // parent (ParentId field points to a valid Id from another row). "hierarchy": "ParentId->Id", } }, "strategies": { "demo": { // Score at which the hit is a good one "min_score": 0.6, // Depending on Type we assign different keywords "keywords": "Name -> Name", "parent_keywords": "Name -> Parent Name", // Depending on Type}, we assign different keywords }, "keywords": "Name -> Name", "parent_keywords": "Name -> Parent Name", }, }, } } |
Custom KEE Pipelet
You might want to exclude part of the item body when running KEE. This can be achieved in a custom KEE pipelet by modifying the item before and after running KEE. This example runs on the first 100 words in the body.
Code Block | ||
---|---|---|
| ||
import squirro.sdk.kee.pipelet
class CustomKEEPipelet(squirro.sdk.kee.pipelet.KeePipelet):
def consume(self, item):
body = item['body']
body_short = ' '.join(body.split(' ')[0:100])
item['body'] = body_short
item = super(CustomKEEPipelet, self).consume(item)
item['body'] = body
return item |