Page Comparison

Excerpt
Smart Filters can be trained from existing text or can be edited manually. This section covers the various ways of creating and editing Smart Filters.

Searching

Noise Level

Concept

When matching documents to a Smart Filter, each document is compared to the Smart Filter. The more closely the document matches the concept, the higher its score.

The noise level determines which documents are returned based on their score. The lower the noise level, the more precise does the match have to be. Setting the noise level to 1.0 (the highest noise level) all results that match at least one entity are returned. At lower levels the matching results are ordered by relevance and low relevance matches are eliminated from the result set. This result elimination is not linear - noise level 0.1 is much stricter than 0.2 for example.

Finding the right noise level

Which noise level is right for a given Smart Filter and use case depends on the requirements of precision vs. recall. Is it more important that the returned results are all relevant or is it preferable to have everything potentially relevant included?

As the default sorting is by relevance, it will often not be readily apparent how a different noise level changes the result set. To understand the relationship between the result set and the noise level in a given Smart Filter, sort the results by date. Then it will be much easier to judge the result relevance.

Setting the noise level

The noise level can be set in three ways:

Use the noise level slide in the Smart Filter drop-down
Use the same slider in the Smart Filter edit view header
Change it directly in the query - the query syntax is "smartfilter:SMARTFILTER_NAME:NOISE_LEVEL"

Multiple Smart Filters

Multiple Smart Filters can be combined in a single query in which case the filters get combined with an AND operator. All results that match both filters are returned. Other boolean operators are not supported for Smart Filters, so OR and NOT will not work.

To work around this, search tagging can be used with Smart Filters. For this, assign keywords by creating a search tagging rule for each Smart Filter. Those keywords then support boolean search as described in the Query Syntax.

Explain

Explain mode can be enabled in the Smart Filter dropdown.

With explain mode enabled each matching document contains some details on why the document matched the Smart Filter.

Limitations

At most the most recent 10,000 results are returned as the result of a Smart Filter query. This is due to performance considerations, because the noise level calculations need to be done for every matching document.

If more than 10,000 results should be reliably returned in a project, combine Smart Filters with search tagging rules.

Training

Editing Workflow

When editing a Smart Filter in the user interface, it is not written to the system unless the Save button is pressed. This way it is possible to experiment with adding training, excluding tokens, etc. without fear of messing up the Smart Filter for the users.

Automated Training

Improving Relevance

When training Smart Filters from documents, terms that do not appear often (or at all) in the language model, will gain a lot of importance. This can often happen with names (people, companies, places) or even mis-spelled words. There are a few strategies to improve the situation when that happens:

Increase the min_feature_count configuration setting (see fingerprint.ini). This way a term needs to appear more often in the training documents to be considered.
The default value of min_feature_count is 1, which means that a term is potentially included in the Smart Filter even if it appears only once in all the training documents together. This can happen especially for terms which are not present in the language model and because of that are calculated to have a high weight. A common example of this happening is names or spelling mistakes.
Exclude irrelevant terms. Note that this may simply promote the next worst term and reducing the number of entities should also be considered (see 86131110 below)
Remove irrelevant content from the training documents. For example headers or footers. This can often be achieved with enrichments, e.g. a pipelet.

Negative Training

When training a Smart Filter items can be added both as positive or negative items. The negative content trains the Smart Filters to detect content that should be excluded.

The considerations on 86131110 apply even more when using Negative Training or you quickly end up with very nonsensical concepts.

Max. Number of Entities

By default 30 terms are extracted from the training content. This can be modified in the advanced screen.

When a Smart Filter doesn't have a lot of relevant terms, then excluding terms will simply promote the next worst term. In those cases (when the concept is smaller than the default) it makes sense to reduce the number of terms.

In other cases where the top 30 terms are all highly relevant, it makes sense to increase the number and see if there are more relevant terms to be displayed.

Manual Smart Filters

The taxonomy for Manual Smart Filters uses a CSV format. The syntax for each entry is:

query, weight, language, label

query: the searched term. This is the only mandatory value.
weight: the importance of the entry amongst all the other terms. The default is 1.
language: the language for this entry. If left empty, this uses the default language - which can be defined in the taxonomy screen as well.
label: the title of this entry, as displayed to the user by the system. If left empty, the query is displayed.

Language

The language is the two-letter language code, as seen in the following table:

...

Language

...

Syntax

...

Fingerprint Stemming

The query must be stemmed based on the rules of the Squirro index.

For example, the following manual Smart Filter does not return any results:

"annual returns",1,en,"annual returns"

Instead, this has to be converted to this entry:

"annual return",1,en,"annual returns"

Having to do this manually is quite cumbersome and the next iteration of Smart Filters will handle this automatically. In the meantime to avoid having to do this manually, the stem_fingerprint.py script can be used. Download that script to the Squirro server - and on the command line execute it as follows:

Code Block

language	powershell

python stem_fingerprint.py fingerprint.csv

This assumes that the fingerprint file has been downloaded from the web interface and is stored in the "fingerprint.csv" file. The script outputs the stemmed queries, and the result can then be pasted into the taxonomy window.

Languages

A Smart Filter can be trained with documents of multiple languages. Results are returned for each language in which a Smart Filter has been trained.

If a Smart Filter has more than one language, the tag cloud only displays one language at any time. A small drop-down at the top of the tag cloud can be used to switch the language and see the terms in other languages.

Image Removed

Out of the box Squirro Smart Filters support the following languages for Smart Filters:

English
German
Italian
French
Spanish
Russian
Portuguese
Chinese

But the basic concept of Smart Filters works for any language. To add a new language to the system, Squirro needs to be trained once to understand the word frequencies for a new language. This is done by creating a GDFS file for that language.

Locking

Smart Filters can be locked in the advanced screen. At that point no changes can be done to the Smart Filter without unlocking it first. Additionally only project administrators can change the lock status of a Smart Filter.

This is a good way of ensuring that Smart Filters that are important to a project can't be changed without thinking about it. Consider locking Smart Filters that are used for dashboards or search taggings.

Smart Filter Configuration

The behavior of Smart Filters can be changed in the fingerprint service (that name is a legacy version of Smart Filters). See the fingerprint.ini file for the available options.

GDFS Files

When training a Smart Filter, Squirro compares the terms in the training documents to the expected term frequency in the given language. For example if the English sentence "the annual results are here" is used, then the terms "the", "are" and "here" should probably not be considered to be interesting terms for the Smart Filter. The language model behind this is called Global Document Frequencies (GDFS). Squirro comes with pre-built GDFS files for the supported languages.

In some use cases it may make sense to build the GDFS files based on the data seen in a specific project. This allows Squirro to normalize for the usage of industry terms or company-internal jargon. To create and use such custom GDFS files, Squirro provides the Global Document Frequency Tool. See that page for instructions on how to use this tool.

Bulk Scoring

The scores of any given document in Squirro against any of the Smart Filters can be calculated and exported using the bulk scoring command line tool and used for further analysis in 3rd party tools, such as Business Intelligence solutions.

...

This page can now be found at Smart Filters on the Squirro Docs site.

Versions Compared

Old Version 15

New Version Current

Key

Table of Contents

Searching

Noise Level

Concept

Finding the right noise level

Setting the noise level

Multiple Smart Filters

Explain

Limitations

Training

Editing Workflow

Automated Training

Improving Relevance

Negative Training

Max. Number of Entities

Manual Smart Filters

Language

Fingerprint Stemming

Languages

Locking

Tags

Smart Filter Configuration

GDFS Files

Bulk Scoring