...
The figure below illustrates how query-processing fits into the overall architecture of the query processing.
...
In the example shown in the figure, the user enters the query country:us 2020-10 covid-19 cases in new york
in a Global Search Bar on the Squirro dashboard. The query is then sent through the Query Understanding Plugin (1) to the ML-Service where the query processing workflow, a Squirro ML-Workflow, is executed to apply the following steps on the incoming query:
...
The final query modifier step applies all modifications to the initial query to produce the Enriched Query (2) which is then used to retrieve the candidate documents that best match the query from the Elasticsearch index (3).
The query Query processing and rewriting improves the search experience by ranking items that match boosted terms higher and reducing the appearance of irrelevant search results for the query. The latter is achieved by combining terms that belong together. Entities like “New York“ will be treated as such in the query, preventing multipage items (e.g., PDFs) that have “new” on one page and “york” on a different page to be matched and appear in the search results.
...
Name | Value | Description |
---|---|---|
| ${ | Set the value to the Remove if you want to disable query processing. |
|
| Modes for workflow execution:
|
Workflow Management
...
the global search bar is not used but query processing is still needed
...
This mode should not be used for Squirro dashboard with many widgets (each widget would trigger the same workflow in parallel)
Workflow Management
You can configure the available workflows under AI STUDIO > ML Workflows.
Every project is equipped with a You can configure the available workflows under AI STUDIO > ML Workflows.
Every project is equipped with a default query-processing workflow per default. This default workflow is read-only and cannot be deleted or modified. It is managed by the Machine-Learning (ML) Service and is automatically updated to the latest version.
...
Query Processing Workflow Steps
The default query processing workflow consists of pre-configured libNLP pipeline steps.uses the following built-in libNLP steps (since 3.6.1 native nlp-app steps → app.query_processing
).
Expand | |||||
---|---|---|---|---|---|
| |||||
|
This workflow is set up to boost important terms based on their POS tags. Nouns (tags NOUN
and PROPN
) are boosted by assigning higher weights in the pos_weight_map
, and for example the impact of verbs (VERB
) is reduced by assigning a lower weight. Terms like determiners or conjunctions are removed from the query.
You can configure the steps of the query processing workflow in the UI in the ML Workflows plugin under the AI STUDIO tab.
...
Step
...
Description
...
Examples
...
custom.syntax_parser
Custom parse
step named syntax_parser
.
Parses the raw query-string into terms and filters. Terms are modified in the query processing workflow. Filters are (like facet filters) not.
Parameters
type (str): `parse`
input_field (str,"query")
Input query.
...
|
...
|
...
|
...
Raw query string parsed into terms and filters.
Configuration
...
language | json |
---|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
Annotation/Output
Example query: country:us 2020-10 covid-19 cases in new york
filters:
"facet_filters": ["country:us"]
user query terms:
"user_terms": ["2020-10", "covid-19", "cases", "in", "new", "york"]
...
custom.lang_detection
...
Custom analysis
step named lang_detection
.
Runs the query through Annotates the query item with the language (facet).
Parameters
type (str): `analysis`
input_field (str,"query")
Input query.
output_field (str,"language")
Detected language as ISO code
fallback_language (str, "en")
Default language to use.
...
Configuration
Code Block | ||
---|---|---|
| ||
{
"step": "custom",
"type": "analysis",
"name": "lang_detection",
"input_field": "user_terms_str"
} |
Annotation/Output
Example query: country:us 2020-10 covid-19 cases in new york
Input: "user_terms_str": "2020-10 covid-19 cases in new york"
Annotation: "language": "en"
...
normalizer.spacy
...
Normalizer step of type spacy
.
Loads the corresponding language model and runs the configured spaCy pipeline components on the text.
The output of the step is an analyzed spaCy document stored under the specified output_fields
and contains the tokens, POS tags and NER tags.
Expand | ||
---|---|---|
| ||
ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations |
Expand | ||
---|---|---|
| ||
See https://github.com/explosion/spaCy/blob/master/spacy/glossary.py#L319-L352 |
Parameters
type (str): `spacy`
input_field (list)
Input fields on which the normalizer is applied.
output_fields (list)
Output fields to save the analyzed spaCy document.
spacy_model_mapping (dict)
Model to use per language-code.
infix_split_hyphen (bool)
Don't split tokens by intra-word-hyphens “covid-19”
merge_entities (bool)
Recognize and merge Named Entities into one SpaCy token, for example, "new york"
merge_noun_chunks (bool)
Merge relevant chunks into one SpaCy token, for example:
extend brexit deadline → extend “brexit deadline”
cacheable (bool)
Cache the selected models.
...
Configuration
Code Block | ||
---|---|---|
| ||
{
"step": "normalizer",
"type": "spacy",
"cacheable": true,
"infix_split_hyphen": false,
"merge_entities": true,
"merge_noun_chunks": false,
"input_fields": [
"user_terms_str"
],
"output_fields": [
"nlp"
],
"exclude_spacy_pipes": [],
"spacy_model_mapping": {
"en": "en_core_web_sm",
"de": "de_core_news_sm"
}
} |
Annotation/Output
Example query: country:us 2020-10 covid-19 cases in new york
Input: "user_terms_str": "2020-10 covid-19 cases in new york"
Annotations:
Tokenisation:
["2020-10", "covid-19", "cases", "in", "new york"]
POS tagging:
[["2020-10","NUM"],["covid-19","NOUN"],["cases","NOUN"],["in","ADP"],["new york","PROPN"]]
Named Entity recognition :
[('2020-10', 'DATE'), ('covid-19', 'ORDINAL'), ('new york', 'GPE')]
The analyzed spaCy document is stored under the output_fields
as nlp
field for further usage in succeeding steps.
...
custom.pos_booster
...
Custom enrich
step named pos_booster
.
Annotates the document with pos_mutations
dictionary that contains the boosting weights for each token.
The weight is chosen from the terms POS tag (as defined in pos_weight_map
).
Parameters
type (str, "enrich"): `enrich`
pos_weight_map (dict, {"PROPN":10,"NOUN":10,"VERB":2,"ADJ":5,"X":-,"NUM":-,"SYM":-})
Dictionary mapping between SpaCy POS tag to weight used for term boosting.
Boost term relevancy: A higher number boosts the relevancy of matched terms.
Skip term boosting: Don't change tokens per POS type by setting the corresponding weight to
-
strict_filter (bool, False)
Remove terms with a POS tag not included in pos_weight_map
phrase_proximity_distance (int, 15)
Merged SpaCy tokens are converted into loose phrases:
All terms need to be matched within close proximity to each other. An example would be "brexit deadline"~15
output_field (str, "pos_mutations")
Map of term => replacement
fallback_language (str, "en")
Default language to use
...
Configuration
Code Block | ||
---|---|---|
| ||
{
"step": "custom",
"type": "enrich",
"name": "pos_booster",
"strict_filter": true,
"phrase_proximity_distance": 3,
"analyzed_input_field": "nlp",
"pos_weight_map": {
"PROPN": 10,
"NOUN": 10,
"VERB": 2,
"ADJ": 5,
"X": -,
"NUM": -,
"SYM": -
}
} |
Annotation/Output
Example query: country:us 2020-10 covid-19 cases in new york
Annotations:
Code Block |
---|
"pos_mutations": [
{"covid-19": "\"covid-19\"~3"}, # `covid-19` needs to be matched as phrase; contains hyphen
{"cases": "cases^10"}, # `cases` gets 10 times boosted
{"in": ""}, # `in` gets removed because ADP is not defined in the `pos_weight_map`
{"new york": "\"new york\"~3"} # `new york` needs to be matched as phrase; merged entity
], |
...
custom.query_modifier
...
Custom enrich
step named query_modifier
.
The query-modifier applies all mutations collected from prior steps to the initial query
and outputs the enriched_query
.
type (str, "enrich"):
Mutates
raw_input_field (str, "query")
Raw user query to modify.
term_mutations_metadata (list,["pos_mutations"])
Mutations applied, order matters.
output_field (str, "enriched_query")
The modified query string.
Configuration
Code Block | ||
---|---|---|
| ||
{
"step": "custom",
"type": "enrich",
"name": "query_modifier",
"raw_input_field": "query",
"term_mutations_metadata": ["pos_mutations"],
"output_field": "enriched_query"
} |
Annotation/Output
Example query: country:us 2020-10 covid-19 cases in new york
Output:
...
|
The workflow is set up to:
Parse Squirro Query Syntax & Detect Query Language based on available natural language terms
Perform Named Entity Recognition. The Entity Compound gets then rewritten into an additional Phrase Query (
'cases in new york' --> rewritten as --> 'cases in (new york OR "new york"~10)'
)
Boost important terms based on their POS tags. Nouns (tags
NOUN
andPROPN
) are boosted by assigning higher weights in thepos_weight_map
, and for example the impact of verbs (VERB
) is reduced by assigning a lower weight.
Terms like determiners or conjunctions are removed from the query.Perform query classification:
question_or_statement
, vskeyword
You can configure the steps of the query processing workflow in the UI in the ML Workflows plugin under the AI STUDIO tab.
How-to Guides
How-to Customize and Upload libNLP Workflow for Query Processing
How-to Install a SpaCy Language Model
...