Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Fast Look: assuming you have a group of documents you can generate a word cloud and immediately see what are the core topics.

  2. Clustering: you can group together documents with the same set of significant terms and rank them according to the distance.

  3. Sentiment Analysis: you can show what are the negative and positive terms in the corpora.

How to use Significant Terms in Squirro

Example of workflow:

Code Block
{
    "dataset": {
        "infer": {
            "count": 501000,
            "query_string": "NOT ml_terms_version:1"
        },
        "train": {
            "count": 12500,
            "query_string": "*"
        }
    },
    "pipeline": [
        {
            "fields": [
                "body",
                "keywords.ml_terms_version",
                "keywords.terms"
            ],
            "batch_size": 100,
            "step": "loader",
            "type": "squirro_query"
        },
        {
            "fields": [
                "body"
            ],
            "step": "filter",
            "type": "empty"
        },
        {
            "fields": [
                "keywords.terms"
            ],
            "step": "filter",
            "type": "clear"
        },
        {
            "add_lemmization": false,
            "cache_lemmas": false,
            "filter_list": [],
            "input_field": [
                "body"
            ],
            "max_chunk_len": 2,
            "min_word_len": 4,
            "output_field": "keywords.terms",
            "p_significant_terms": 0.2,
            "save_model": true,
            "step": "embedder",
            "type": "terms_extraction"
        },
        {
            "fields": [
                "keywords.terms"
            ],
            "step": "saver",
            "tracking_facet_name": "ml_terms_version",
            "tracking_facet_value": "1",
            "type": "squirro_item"
        }
    ]
}

The parameters of the step are:

Code Block
languagejson
{
            "add_lemmization": false, # add lemmization to the words
            "cache_lemmas": false, # save the lemmas in a cache to improve the peroformances
     
      "filter_list": [], # List of terms you do not want to see
       
    "input_field": [ "body"], # where get the input text
    
       "max_chunk_len": 2, # min number of words in a chunk
            "min_word_len": 4, #  min character in a words
            "output_field": "terms", #where store the output (list)
            "p_significant_terms": 0.2,  # % of important terms to pickup
      
     "save_model": true, #save the generate model
   
        "step": "embedder",
            "type": "terms_extraction"
        },

The input must be a text. HTML Cleaning and word splitting are performed by the embedder itself. The reason of this is to enhance parallelization and make the step scalable.

...

Please note that the inference must be applied also to the train data to generate the significant term list.

Scroll timeout

For slower servers, it may be needed to extend the scroll timeout for the queries. This is the case if the workflow reports errors including the error string No search context found for id.

This can be done by changing the dataset parameters like this:

Code Block
languagejson
{
    "dataset": {
        "infer": {
            "count": 1000,
            "kwargs": {
                "scroll": "30m"
            },
            "query_string": "NOT ml_terms_version:1"
        },
        "train": {
            "count": 2500,
            "kwargs": {
                "scroll": "30m"
            },
            "query_string": "*"
        }
    },
    …
}