Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

An introduction into concept search and how it related to the Squirro Smart Filters.

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Concept Search

Concept search is a technology where a search engine can return documents matching a defined concept (see Wikipedia's concept search article for further background).

As an example compare normal full text search with concept search. In standard full text search some sort of boolean search is used to return relevant documents. User can enter a query such as "technology company" find any document that matches both "technology" and "company". In concept search this would be defined as a concept, where the system is taught to understand what "technology company" means. It will then take into account additional terms around this area and search for those as well - for example the "technology company" concept may potentially include terms such as "IT" or "Silicon Valley". As a result the search engine can return a more and better results for this concept.

Squirro's implementation of the concept search model is the Smart Filter technology.

Smart Filter Overview

Smart Filters are trained with text documents. Training documents can be paragraphs of text or entire documents. Various formats from plain text to PDF and Microsoft Word are supported.

The Smart Filter algorithm looks at the frequency of terms in the training documents and correlates them with normal occurrence of these terms (see Global Document Frequencies - GDFS). If a word occurs more often in the training text than in the GDFS definition it receives a higher score.

When using a Smart Filter, the index is searched using all the entities that the Smart Filter was trained with. Results get ranked higher based on the number of matching entities and the score of the matching entity. The strictness of a Smart Filter can be controlled by the Noise Level - set to 1.0 (the highest Noise level) all results that match at least one entity are returned. At a lower level (e.g. 0.2) the matching results get ordered by relevance and low relevance matches are eliminated from the result set. Note that the result elimination is not linear; Noise level 0.1 is much stricter than 0.2 for example - this is easy to inspect by adjusting the Noise level in the Squirro UI.

Languages

A Smart Filter can be trained with documents of multiple languages. Squirro detects the language of each document and will create a cluster of the top entities for each language. During a query the entities for each language will be used to filter only documents from the corresponding language.

Out of the box Squirro Smart Filters support the following languages:

  • Chinese

  • Dutch

  • English

  • French

  • German

  • Italian

  • Portuguese

  • Russian

  • Spanish

However the Smart Filter concept works for almost any language. Squirro needs to be trained once to understand the word frequencies for a new language. This is done by creating a GDFS database and the following languages are supported out of the box for this process (in addition to the ones already listed above):

...

Arabic

...

Armenian

...

Basque

...

Bengali

...

Bulgarian

...

Catalan

...

Czech

...

Finnish

...

Galician

...

Hindi

...

Hungarian

...

Indonesian

...

Irish

...

Latvian

...

Lithuanian

...

Norwegian

...

Romanian

...

Sorani

...

Swedish

...

This page can now be found at Smart Filters on the Squirro Docs site.