Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

An introduction into concept search and how it related to the Squirro Smart Filters.

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Concept Search

Concept search is a technology where a search engine can return documents matching a defined concept (see Wikipedia's concept search article for further background).

...

Squirro's implementation of the concept search model is the Smart Filter technology.

Smart Filter Overview

Smart Filters are trained with text documents. Training documents can be paragraphs of text or entire documents. Various formats from plain text to PDF and Microsoft Word are supported.

...

When using a Smart Filter, the index is searched using all the entities that the Smart Filter was trained with. Results get ranked higher based on the number of matching entities and the score of the matching entity. The strictness of a Smart Filter can be controlled by the Noise Level - set to 1.0 (the highest Noise level) all results that match at least one entity are returned. At a lower level (e.g. 0.2) the matching results get ordered by relevance and low relevance matches are eliminated from the result set. Note that the result elimination is not linear; Noise level 0.1 is much stricter than 0.2 for example - this is easy to inspect by adjusting the Noise level in the Squirro UI.

Languages

A Smart Filter can be trained with documents of multiple languages. Squirro detects the language of each document and will create a cluster of the top entities for each language. During a query the entities for each language will be used to filter only documents from the corresponding language.

Out of the box Squirro Smart Filters support the following languages:

  • Chinese

  • Dutch

  • English

  • French

  • German

  • Italian

  • Portuguese

  • Russian

  • Spanish

However the Smart Filter concept works for almost any language. Squirro needs to be trained once to understand the word frequencies for a new language. This is done by creating a GDFS database and the following languages are supported out of the box for this process (in addition to the ones already listed above):

  • Arabic

  • Armenian

  • Basque

  • Bengali

  • Bulgarian

  • Catalan

  • Czech

  • Finnish

  • Galician

  • Hindi

  • Hungarian

  • Indonesian

  • Irish

  • Latvian

  • Lithuanian

  • Norwegian

  • Romanian

  • Sorani

  • Swedish

  • Turkish