Step | Description | Examples |
---|
custom.syntax_parser
| Custom parse step named syntax_parser . Parses the raw query-string into terms and filters. Terms are modified in the query processing workflow. Filters are (like facet filters) not. Parameters type (str): `parse`
input_field (str,"query")
Input query. output_fields (list, ["user_terms", "user_terms_str", "facet_filters", "query_length"])
Raw query string parsed into terms and filters. | Configuration Code Block |
---|
| {
"step": "custom",
"type": "parse",
"name": "syntax_parser",
"output_fields": [
"user_terms",
"facet_filters"
]
} |
Annotation/Output Example query: country:us 2020-10 covid-19 cases in new york filters: "facet_filters": ["country:us"] user query terms: "user_terms": ["2020-10", "covid-19", "cases", "in", "new", "york"]
|
custom.lang_detection
| Custom analysis step named lang_detection . Runs the query through Annotates the query item with the language (facet). Parameters type (str): `analysis`
input_field (str,"query")
Input query. output_field (str,"language")
Detected language as ISO code fallback_language (str, "en")
Default language to use. | Configuration Code Block |
---|
| {
"step": "custom",
"type": "analysis",
"name": "lang_detection",
"input_field": "user_terms_str"
} |
Annotation/Output Example query: country:us 2020-10 covid-19 cases in new york Input: "user_terms_str": "2020-10 covid-19 cases in new york" Annotation: "language": "en" |
normalizer.spacy
| Normalizer step of type spacy . Loads the corresponding language model and runs the configured spaCy pipeline components on the text. The output of the step is an analyzed spaCy document stored under the specified output_fields and contains the tokens, POS tags and NER tags. Expand |
---|
| ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations |
Parameters type (str): `spacy`
input_field (list)
Input fields on which the normalizer is applied. output_fields (list)
Output fields to save the analyzed spaCy document. spacy_model_mapping (dict)
Model to use per language-code. infix_split_hyphen (bool)
Don't split tokens by intra-word-hyphens “covid-19” merge_entities (bool)
Recognize and merge Named Entities into one SpaCy token, for example, "new york" merge_noun_chunks (bool)
Merge relevant chunks into one SpaCy token, for example: extend brexit deadline → extend “brexit deadline” cacheable (bool)
Cache the selected models. | Configuration Code Block |
---|
| {
"step": "normalizer",
"type": "spacy",
"cacheable": true,
"infix_split_hyphen": false,
"merge_entities": true,
"merge_noun_chunks": false,
"input_fields": [
"user_terms_str"
],
"output_fields": [
"nlp"
],
"exclude_spacy_pipes": [],
"spacy_model_mapping": {
"en": "en_core_web_sm",
"de": "de_core_news_sm"
}
} |
Annotation/Output Example query: country:us 2020-10 covid-19 cases in new york Input: "user_terms_str": "2020-10 covid-19 cases in new york" Annotations: Tokenisation: ["2020-10", "covid-19", "cases", "in", "new york"] POS tagging: [["2020-10","NUM"],["covid-19","NOUN"],["cases","NOUN"],["in","ADP"],["new york","PROPN"]] Named Entity recognition : [('2020-10', 'DATE'), ('covid-19', 'ORDINAL'), ('new york', 'GPE')]
The analyzed spaCy document is stored under the output_fields as nlp field for further usage in succeeding steps. |
custom.pos_booster
| Custom enrich step named pos_booster . Annotates the document with pos_mutations dictionary that contains the boosting weights for each token. The weight is chosen from the terms POS tag (as defined in pos_weight_map ). Parameters type (str, "enrich"): `enrich`
pos_weight_map (dict, {"PROPN":10,"NOUN":10,"VERB":2,"ADJ":5,"X":-,"NUM":-,"SYM":-})
Dictionary mapping between SpaCy POS tag to weight used for term boosting. strict_filter (bool, False)
Remove terms with a POS tag not included in pos_weight_map phrase_proximity_distance (int, 15)
Merged SpaCy tokens are converted into loose phrases: output_field (str, "pos_mutations")
Map of term => replacement fallback_language (str, "en")
Default language to use | Configuration Code Block |
---|
| {
"step": "custom",
"type": "enrich",
"name": "pos_booster",
"strict_filter": true,
"phrase_proximity_distance": 3,
"analyzed_input_field": "nlp",
"pos_weight_map": {
"PROPN": 10,
"NOUN": 10,
"VERB": 2,
"ADJ": 5,
"X": -,
"NUM": -,
"SYM": -
}
} |
Annotation/Output Example query: country:us 2020-10 covid-19 cases in new york Annotations: Code Block |
---|
"pos_mutations": [
{"covid-19": "\"covid-19\"~3"}, # `covid-19` needs to be matched as phrase; contains hyphen
{"cases": "cases^10"}, # `cases` gets 10 times boosted
{"in": ""}, # `in` gets removed because ADP is not defined in the `pos_weight_map`
{"new york": "\"new york\"~3"} # `new york` needs to be matched as phrase; merged entity
], |
|
custom.query_modifier
| Custom enrich step named query_modifier . The query-modifier applies all mutations collected from prior steps to the initial query and outputs the enriched_query . type (str, "enrich"):
Mutates raw_input_field (str, "query")
Raw user query to modify. term_mutations_metadata (list,["pos_mutations"])
Mutations applied, order matters. output_field (str, "enriched_query")
The modified query string. | Configuration Code Block |
---|
| {
"step": "custom",
"type": "enrich",
"name": "query_modifier",
"raw_input_field": "query",
"term_mutations_metadata": ["pos_mutations"],
"output_field": "enriched_query"
} |
Annotation/Output Example query: country:us 2020-10 covid-19 cases in new york Output: "enriched_query": "country:us \"2020-10\"~3 \"covid-19\"~3 cases^10 \"new york\"~3"
|