Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Step

Description

Examples

custom.syntax_parser

Custom parse step named syntax_parser.

Parses the raw query-string into terms and filters. Terms are modified in the query processing workflow. Filters are (like facet filters) not.

Parameters

type (str): `parse`

input_field (str,"query")

Input query.

output_fields (list, ["user_terms", "user_terms_str", "facet_filters", "query_length"])

Raw query string parsed into terms and filters.

Configuration

Code Block
languagejson
{
  "step": "custom",
  "type": "parse",
  "name": "syntax_parser",
  "output_fields": [
    "user_terms",
    "facet_filters"
  ]
}

Annotation/Output

Example query: country:us 2020-10 covid-19 cases in new york

  • filters: "facet_filters": ["country:us"]

  • user query terms: "user_terms": ["2020-10", "covid-19", "cases", "in", "new", "york"]

custom.lang_detection

Custom analysis step named lang_detection.

Runs the query through Annotates the query item with the language (facet).

Parameters

type (str): `analysis`

input_field (str,"query")

Input query.

output_field (str,"language")

Detected language as ISO code

fallback_language (str, "en")

Default language to use.

Configuration

Code Block
languagejson
{
  "step": "custom",
  "type": "analysis",
  "name": "lang_detection",
  "input_field": "user_terms_str"
}

Annotation/Output

Example query: country:us 2020-10 covid-19 cases in new york

Input: "user_terms_str": "2020-10 covid-19 cases in new york"

Annotation: "language": "en"

normalizer.spacy

Normalizer step of type spacy.

Loads the corresponding language model and runs the configured spaCy pipeline components on the text.

The output of the step is an analyzed spaCy document stored under the specified output_fields and contains the tokens, POS tags and NER tags.

Expand
titleKnown POS Tags

ADJ: adjectives ADP: adpositions (prepositions and postpositions) ADV: adverbs CONJ: conjunctions DET: determiners INTJ: interjection NOUN: nouns NUM: numeral PART: particles PRON: pronouns PROPN: proper nouns PUNCT: punctuations SPACE: spaces SYM: symbols VERB: verbs (all tenses and modes) X: other: foreign words, typos, abbreviations

Expand
titleAvailable NER Tags

See https://github.com/explosion/spaCy/blob/master/spacy/glossary.py#L319-L352

Parameters

type (str): `spacy`

input_field (list)

Input fields on which the normalizer is applied.

output_fields (list)

Output fields to save the analyzed spaCy document.

spacy_model_mapping (dict)

Model to use per language-code.

infix_split_hyphen (bool)

Don't split tokens by intra-word-hyphens “covid-19”

merge_entities (bool)

Recognize and merge Named Entities into one SpaCy token, for example, "new york"

merge_noun_chunks (bool)

Merge relevant chunks into one SpaCy token, for example:
extend brexit deadline → extend “brexit deadline”

cacheable (bool)

Cache the selected models.

Configuration

Code Block
languagejson
{
      "step": "normalizer",
      "type": "spacy",
      "cacheable": true,
      "infix_split_hyphen": false,        
      "merge_entities": true,
      "merge_noun_chunks": false,
      "input_fields": [
        "user_terms_str"
      ],
      "output_fields": [
        "nlp"
      ],
      "exclude_spacy_pipes": [],
      "spacy_model_mapping": {
        "en": "en_core_web_sm",
        "de": "de_core_news_sm"
      }
  }

Annotation/Output

Example query: country:us 2020-10 covid-19 cases in new york

Input: "user_terms_str": "2020-10 covid-19 cases in new york"

Annotations:

  • Tokenisation: ["2020-10", "covid-19", "cases", "in", "new york"]

  • POS tagging: [["2020-10","NUM"],["covid-19","NOUN"],["cases","NOUN"],["in","ADP"],["new york","PROPN"]]

  • Named Entity recognition : [('2020-10', 'DATE'), ('covid-19', 'ORDINAL'), ('new york', 'GPE')]

The analyzed spaCy document is stored under the output_fields as nlp field for further usage in succeeding steps.

custom.pos_booster

Custom enrich step named pos_booster.

Annotates the document with pos_mutations dictionary that contains the boosting weights for each token.
The weight is chosen from the terms POS tag (as defined in pos_weight_map).

Parameters

type (str, "enrich"): `enrich`

pos_weight_map (dict, {"PROPN":10,"NOUN":10,"VERB":2,"ADJ":5,"X":-,"NUM":-,"SYM":-})

Dictionary mapping between SpaCy POS tag to weight used for term boosting.

  • Boost term relevancy: A higher number boosts the relevancy of matched terms.

  • Skip term boosting: Don't change tokens per POS type by setting the corresponding weight to -

strict_filter (bool, False)

Remove terms with a POS tag not included in pos_weight_map

phrase_proximity_distance (int, 15)

Merged SpaCy tokens are converted into loose phrases:

  • All terms need to be matched within close proximity to each other. An example would be "brexit deadline"~15

output_field (str, "pos_mutations")

Map of term => replacement

fallback_language (str, "en")

Default language to use

Configuration

Code Block
languagejson
{
  "step": "custom",
  "type": "enrich",
  "name": "pos_booster",
  "strict_filter": true,
  "phrase_proximity_distance": 3,
  "analyzed_input_field": "nlp",
  "pos_weight_map": {
    "PROPN": 10,
    "NOUN": 10,
    "VERB": 2,
    "ADJ": 5,
    "X": -,
    "NUM": -,
    "SYM": -
  }
}

Annotation/Output

Example query: country:us 2020-10 covid-19 cases in new york

Annotations:

Code Block
"pos_mutations": [
{"covid-19": "\"covid-19\"~3"},     # `covid-19` needs to be matched as phrase; contains hyphen
{"cases": "cases^10"},          # `cases` gets 10 times boosted
{"in": ""},                  # `in` gets removed because ADP is not defined in the `pos_weight_map`
{"new york":  "\"new york\"~3"}    # `new york` needs to be matched as phrase; merged entity
],

custom.query_modifier

Custom enrich step named query_modifier.

The query-modifier applies all mutations collected from prior steps to the initial query and outputs the enriched_query.

type (str, "enrich"):

Mutates

raw_input_field (str, "query")

Raw user query to modify.

term_mutations_metadata (list,["pos_mutations"])

Mutations applied, order matters.

output_field (str, "enriched_query")

The modified query string.

Configuration

Code Block
languagejson
{
  "step": "custom",
  "type": "enrich",
  "name": "query_modifier",
  "raw_input_field": "query",
  "term_mutations_metadata": ["pos_mutations"],
  "output_field": "enriched_query"
}

Annotation/Output

Example query: country:us 2020-10 covid-19 cases in new york

Output:

"enriched_query": "country:us \"2020-10\"~3 \"covid-19\"~3 cases^10 \"new york\"~3"

How-to Guides

We are currently preparing guides on how to customize the query processing.How-to Customize and Upload libNLP Workflow for Query Processing