Significant Terms Extraction in Squirro

Available from Squirro 3.1.0

The goal of this section is to provide a detailed document regarding how the Significant Terms Extractor works and how it can be used inside a Squirro project.

Extracting Significant Terms

Given a set of documents, the goal of significant terms extractor is to identify in the text the set of words that can potentially describe a document. It is an unsupervised technique that constantly adapts itself at every call to maintain the model always up to date.

The subsequent idea is to use this topic in various use-cases:

Fast Look: assuming you have a group of documents you can generate a word cloud and immediately see what are the core topics.
Clustering: you can group together documents with the same set of significant terms and rank them according to the distance.
Sentiment Analysis: you can show what are the negative and positive terms in the corpora.

How to use Significant Terms in Squirro

Example of workflow:

{
    "dataset": {
        "infer": {
            "count": 1000,
            "query_string": "NOT ml_terms_version:1"
        },
        "train": {
            "count": 2500,
            "query_string": "*"
        }
    },
    "pipeline": [
        {
            "fields": [
                "body",
                "keywords.ml_terms_version",
                "keywords.terms"
            ],
            "batch_size": 100,
            "step": "loader",
            "type": "squirro_query"
        },
        {
            "fields": [
                "body"
            ],
            "step": "filter",
            "type": "empty"
        },
        {
            "fields": [
                "keywords.terms"
            ],
            "step": "filter",
            "type": "clear"
        },
        {
            "add_lemmization": false,
            "cache_lemmas": false,
            "filter_list": [],
            "input_field": [
                "body"
            ],
            "max_chunk_len": 2,
            "min_word_len": 4,
            "output_field": "keywords.terms",
            "p_significant_terms": 0.2,
            "save_model": true,
            "step": "embedder",
            "type": "terms_extraction"
        },
        {
            "fields": [
                "keywords.terms"
            ],
            "step": "saver",
            "tracking_facet_name": "ml_terms_version",
            "tracking_facet_value": "1",
            "type": "squirro_item"
        }
    ]
}

The parameters of the step are:

{
    "add_lemmization": false, # add lemmization to the words
    "cache_lemmas": false, # save the lemmas in a cache to improve the peroformances
    "filter_list": [], # List of terms you do not want to see
    "input_field": [ "body"], # where get the input text
    "max_chunk_len": 2, # min number of words in a chunk
    "min_word_len": 4, #  min character in a words
    "output_field": "terms", #where store the output (list)
    "p_significant_terms": 0.2,  # % of important terms to pickup
    "save_model": true, #save the generate model
    "step": "embedder",
    "type": "terms_extraction"
},

The input must be a text. HTML Cleaning and word splitting are performed by the embedder itself. The reason of this is to enhance parallelization and make the step scalable.

The output is a list of significant words per document .

The lemmatization process reduces the words in a primitive form (e.g. better became good). It is very useful but it also impacts the overall performances.

In this case is important to define both Train and Inference jobs.

Train: it generates an initial model therefore he must have at least an interesting number of documents to produce a significant output from the beginning. It does not save the list back on the original item.
Inference: it loads the model, it updates it with the words in the current document, it produces the list and it saves the new model.

Please note that the inference must be applied also to the train data to generate the significant term list.

Scroll timeout

For slower servers, it may be needed to extend the scroll timeout for the queries. This is the case if the workflow reports errors including the error string No search context found for id.

This can be done by changing the dataset parameters like this:

{
    "dataset": {
        "infer": {
            "count": 1000,
            "kwargs": {
                "scroll": "30m"
            },
            "query_string": "NOT ml_terms_version:1"
        },
        "train": {
            "count": 2500,
            "kwargs": {
                "scroll": "30m"
            },
            "query_string": "*"
        }
    },
    …
}