Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When editing a Smart Filter in the user interface, it is not written to the system unless the Save button is pressed. This way it is possible to experiment with adding training, excluding tokens, etc. without fear of messing up the Smart Filter for the users.

Automated Training

Improving Relevance

When training Smart Filters from documents, terms that do not appear often (or at all) in the language model, will gain a lot of importance. This can often happen with names (people, companies, places) or even mis-spelled words. There are a few strategies to improve the situation when that happens:

  • Increase the min_feature_count configuration setting (see fingerprint.ini). This way a word needs to appear more often in the training text to be considered.
    The default value of min_feature_count is 1, which means that a term is potentially included in the Smart Filter even if it only appears once in the training content. This can happen especially for terms which are not present in the language model and because of that are calculated to have a high weight.
  • Exclude irrelevant terms. Note that this may simply promote the next worst term and reducing the number of entities should also be considered (see Max. Number of Entities below)
  • Remove irrelevant content from the training documents. For example headers or footers. This can often be achieved with enrichments, e.g. a pipelet.

Negative Training

When training a Smart Filter items can be added both as positive or negative items. The negative content trains the Smart Filters to detect content that should be excluded.This functionality is currently experimental and may need fine-tuning to make sure negative documents are given the right weight. If you have need for this in your projects, contact support

The considerations on Improving Relevance apply even more when using Negative Training or you quickly end up with very nonsensical concepts.

Max. Number of Entities

By default 30 terms are extracted from the training content. This can be modified in the advanced screen.

...