Query processing improves a user’s search experience by providing more relevant search results. Squirro achieves this improvement by running the user’s query through a customizeable query processing workflow that parses, filters, enriches, and expands queries before performing the actual search and presenting the search results to the user. For example, part of speech (POS) boosting and filtering removes irrelevant terms like conjunctions from the query and gives more weight to relevant parts of the query like nouns. Items that match boosted query terms are ranked higher in the returned search results.
Overview
The figure below illustrates the architecture of the query processing.
In the example shown in the figure, the user enters the query country:us 2020-10 covid-19 cases in new york
in a Global Search Bar on the Squirro dashboard. The query is then sent through the Query Understanding Plugin (1) to the ML-Service where the query processing workflow, a Squirro ML-Workflow, is executed to apply the following steps on the incoming query:
Language detection
Language-specific spaCy Analysis is applied using the pre-trained spaCy language model (see example) for the detected language. The analysis includes:
Tokenization and lemmatization
Part of Speech (POS) tagging
Named Entity Recognition (NER)
Part of Speech Booster / Filter
Assigns weight to tokens based on their POS tags
Conjunctions and determiners are removed
Query Modifier.
The final query modifier step applies all modifications to the initial query to produce the Enriched Query (2) which is then used to retrieve the candidate documents that best match the query from the Elasticsearch index (3).
The query processing improves the search experience by ranking items that match boosted terms higher and reducing the appearance of irrelevant search results for the query. The latter is achieved by combining terms that belong together. Entities like “New York“ will be treated as such in the query, preventing multipage items (e.g., PDFs) that have “new” on one page and “york” on a different page to be matched and appear in the search results.
Configuration
Starting with Squirro 3.4.5, each project will be pre-configured with a default query processing workflow. The workflow is installed on the server as a global asset and cannot be deleted via the user interface. It is enabled by default.
The behaviour of the workflow is managed in the project configuration under the SETTINGS tab where you can configure the following settings:
Name | Value | Description |
---|---|---|
| ${ | Set the value to the Remove if you want to disable query processing. |
|
| Modes for workflow execution:
|
Workflow Management
You can configure the available workflows under AI STUDIO > ML Workflows. The default query processing workflow is as the ACTIVE QUERY PROCESSOR and is listed along with any other custom workflow.
Hovering over a workflow, you can click SET ACTIVE to make the workflow the ACTIVE QUERY PROCESSOR.
Query Processing Workflow Steps
The query processing workflow consists of pre-configured libNLP pipeline steps.
This workflow is set up to boost important terms based on their POS tags. Nouns (tags NOUN
and PROPN
) are boosted by assigning higher weights in the pos_weight_map
, and for example the impact of verbs (VERB
) is reduced by assigning a lower weight. Terms like determiners or conjunctions are removed from the query.
You can configure the steps of the query processing workflow in the UI in the ML Workflows plugin under the AI STUDIO tab.
Step | Description | Examples |
---|---|---|
| Custom Parses the raw query-string into terms and filters. Terms are modified in the query processing workflow. Filters are (like facet filters) not. Parameters
Input query.
Raw query string parsed into terms and filters. | Configuration { "step": "custom", "type": "parse", "name": "syntax_parser", "output_fields": [ "user_terms", "facet_filters" ] } Annotation/Output Example query:
|
| Custom Runs the query through Annotates the query item with the language (facet). Parameters
Input query.
Detected language as ISO code
Default language to use. | Configuration { "step": "custom", "type": "analysis", "name": "lang_detection", "input_field": "user_terms_str" } Annotation/Output Example query: Input: Annotation: |
| Normalizer step of type Loads the corresponding language model and runs the configured spaCy pipeline components on the text. The output of the step is an analyzed spaCy document stored under the specified Parameters
Input fields on which the normalizer is applied.
Output fields to save the analyzed spaCy document.
Model to use per language-code.
Don't split tokens by intra-word-hyphens “covid-19”
Recognize and merge Named Entities into one SpaCy token, for example, "new york"
Merge relevant chunks into one SpaCy token, for example:
Cache the selected models. | Configuration { "step": "normalizer", "type": "spacy", "cacheable": true, "infix_split_hyphen": false, "merge_entities": true, "merge_noun_chunks": false, "input_fields": [ "user_terms_str" ], "output_fields": [ "nlp" ], "exclude_spacy_pipes": [], "spacy_model_mapping": { "en": "en_core_web_sm", "de": "de_core_news_sm" } } Annotation/Output Example query: Input: Annotations:
The analyzed spaCy document is stored under the |
| Custom Annotates the document with Parameters
Dictionary mapping between SpaCy POS tag to weight used for term boosting.
Remove terms with a POS tag not included in
Merged SpaCy tokens are converted into loose phrases:
Map of term => replacement
Default language to use | Configuration { "step": "custom", "type": "enrich", "name": "pos_booster", "strict_filter": true, "phrase_proximity_distance": 3, "analyzed_input_field": "nlp", "pos_weight_map": { "PROPN": 10, "NOUN": 10, "VERB": 2, "ADJ": 5, "X": -, "NUM": -, "SYM": - } } Annotation/Output Example query: Annotations: "pos_mutations": [ {"covid-19": "\"covid-19\"~3"}, # `covid-19` needs to be matched as phrase; contains hyphen {"cases": "cases^10"}, # `cases` gets 10 times boosted {"in": ""}, # `in` gets removed because ADP is not defined in the `pos_weight_map` {"new york": "\"new york\"~3"} # `new york` needs to be matched as phrase; merged entity ], |
| Custom The query-modifier applies all mutations collected from prior steps to the initial
Mutates
Raw user query to modify.
Mutations applied, order matters.
The modified query string. | Configuration { "step": "custom", "type": "enrich", "name": "query_modifier", "raw_input_field": "query", "term_mutations_metadata": ["pos_mutations"], "output_field": "enriched_query" } Annotation/Output Example query: Output:
|
How-to Guides
How-to Customize and Upload libNLP Workflow for Query Processing