Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This guide showcases how to:

  • write a custom libNLP step to extend the capabilities of the default query processing workflow

  • upload and set a new query processing workflow on your project

The example in this guide adds a custom Query Classifier to perform the search operation only within a smaller, filtered subset (inferred faceted search). This allows application developers to improve the overall search experience, in this case by returning documents that share the same topic as the user query.

Example:

  • Query processing input:
    main symptoms of flu vs covid
    Classified label: topic:"health care"

  • Query processing output:
    (main symptoms of flu vs covid) AND (topic:"health care")

You can also implement any other kind of step, following the steps outlined in this guide.

Here is some inspiration:

  • Expansion on Abbreviations
    Expand on known domain-specific abbreviations (fetching information from a 3rd party system)
    SQuAD(SQuAD OR \"Stanford Question Answering Dataset\")
    IT(IT OR \"Information Technology\")

Info

Requirements for to follow all steps in this guide:

  • Local installation of Squirro Toolbox

  • Local installation of libNLP

  • Local installed en_core_web_sm SpaCy model

    Code Block
    python -m spacy download en_core_web_sm

Workflow Structure and Steps

A workflow is configured via json format and can be composed by any combination of built-in and custom steps. Custom steps have to be placed together with the configuration in the same folder.

Code Block
/custom_query_processing
  / config.json
  / my_query_classifier.py

The provided config.json is an extension on the default query processing workflow

  • Built-in query-processing steps are of type: step:app & type:query_processing

Expand
titleconfig.json Overall
Code Block
languagejson
{
    "cacheable": true,
    "pipeline": [
        {
            "fields": ["query"],
            "step": "loader",
            "type": "squirro_item"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "syntax_parser"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lang_detection",
            "fallback_language": "en"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "custom_spacy_normalizer",
            "cache_document": true,
            "infix_split_hyphen": false,
            "infix_split_chars": ":<>=",
            "merge_noun_chunks": false,
            "merge_phrases": true,
            "merge_entities": true,
            "fallback_language": "en",
            "exclude_spacy_pipes": [],
            "spacy_model_mapping": {
                "en": "en_core_web_sm",
                "de": "de_core_news_sm"
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "pos_booster",
            "phrase_proximity_distance": 15,
            "pos_weight_map": {
                "PROPN": 10,
                "NOUN": 10,
                "VERB": 2,
                "ADJ": 5,
                "X": "-",
                "NUM": "-",
                "SYM": "-"
            }
        },
        {
            "step": "custom",
            "type": "classifier",
            "name": "my_query_classifier",
            "model": "valhalla/distilbart-mnli-12-1",
            "target_facet":"topic",
            "target_classes": ["login tutorial", "sports", "health care", "merge and acquisition", "stock market"],
            "output_field": "my_classified_topic"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier",
            "term_mutations_metadata": [
              "pos_mutations",
              "my_classified_topic"
            ]
        },
        {
            "step": "debugger",
            "type": "log_fields",
            "fields": ["user_terms", "facet_filters", "pos_mutations", "type", "enriched_query","my_classified_topic"],
            "log_level": "info"
        }
    ]
}

Expand
titleReference: custom step config
Code Block
languagejson
# 1) Custom step that appends writes metadata: `my_classified_topic`
{
    "step": "custom",
    "type": "classifier",
    "name": "my_query_classifier",
    "model": "valhalla/distilbart-mnli-12-1",
    "target_facet":"topic",
    "target_classes": ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market'],
    "output_field": "my_classified_topic"
},

# 2) The built-in `query_modifier` step rewrites the original query based on metadata added in prior steps in the pipeline 
#    -> like: `query = f"{original_query} AND {my_classified_topic}"`
{
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier",
            "term_mutations_metadata": [
              "pos_mutations",
              "my_classified_topic"
            ]
}

Custom Classifier Step

Add the file my_query_classifier.py as:

...

titleImplementation of a custom QueryClassifier step
Code Block
languagepy
import functools
import logging

from squirro.lib.nlp.steps.batched_step import BatchedStep
from squirro.lib.nlp.document import Document
from squirro.lib.nlp.utils.cache import CacheDocument

from squirro.common.profiler import SlowLog

from transformers import Pipeline as ZeroShotClassifier
from transformers import pipeline


class MyClassifier(BatchedStep):
    """
    Classify query into predefined classes using zero-shot-classification.


    Parameters:
        input_field (str, "user_terms_str"): raw user query strings
        model (str, "valhalla/distilbart-mnli-12-1"): zero shot classification to use
        target_facet (str): Target squirro-label used for faceted search
        target_classes (list, ["stocks", "sport", "music"]): Possible classes
        output_field (str, "my_classified_topic"): new facet filters to append to the query
        confidence_threshold (float, 0.3): Use classified labels only if model predicted it with high enough confidence
        step (str, "custom"): my classifier
        type (str, "classifier"): my classifier
        name (str, "my_classifier"): my classifier
        path (str, "."): my classifier
    """

    def quote_facet_name(self, label):
        if len(label.split()) > 1:
            label = f'"{label}"'
        return label

    @CacheDocument
    @SlowLog(logger=logging.info, suffix="0-shot-classifier", threshold=100)
    def process_doc(self, doc: Document):
        try:
            classifier: ZeroShotClassifier = self.model_cache.get_and_save_model(
                self.model,
                functools.partial(
                    pipeline, task="zero-shot-classification", model=self.model
                ),
            )
        except Exception:
            logging.exception("Huggingface pipeline crashed")
            # make sure that aborted tasks are not used for caching
            return doc.abort_processing()

        query = doc.fields.get(self.input_field)
        predictions = classifier(query, self.target_classes)
        value = predictions["labels"][0]
        score = predictions["scores"][0]

        if score > self.confidence_threshold:
            doc.fields[
                self.output_field
            ] = f"{self.target_facet}:{self.quote_facet_name(value)}"
        return doc


Configuration

You can configure the step in the config.json according the provided step Parameters.

Local Testing

Test your lilbNLP step locally during development. Instantiate your code and provide a squirro.lib.nlp.document.Document together with the configuration for the steps you want to test.

Example content for a simple baseline test test_my_classifier.py:

Code Block
languagepy
from my_query_classifier import MyClassifier

if __name__ == "__main__":
    # Documents are tagged with facet called `topic`
    target_facet = "topic"
    # The facet `topic` can be one of the following values from `target_classes`
    target_classes = ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market']

    # Instantiate custom classifier step
    step = MyClassifier(config={
        "target_facet": "topic",
        "target_classes": target_classes,
    })

    # Setup simple test cases
    queries = [
        "how to connect to wlan",
        "elon musk buys shares at twitter",
        "main symptoms of flu vs covid"
    ]

    for query in queries:
        doc = Document(doc_id="", fields={"user_terms_str": query})
        step.process_doc(doc)
        print("=================")
        print(f"Classified Query")
        print(f"\tQuery:\t{query}")
        print(f"\tLabel:\t{doc.fields.get('facet_filters')}")

Demo Output

Code Block
languagebash
$ python test_custom_spacy_normalizer.py

 =================
 Query Classified
        Query:  'how to connect to wlan'
        Label:  'topic:"login tutorial"'
 =================
 Query Classified
        Query:  'elon musk buys shares at twitter'
        Label:  'topic:"stock market"'
 =================
 Query Classified
        Query:  'main symptoms of flu vs covid'
        Label:  'topic:"health care"'
 =================

Upload

You require the token, cluster and project_id to upload the workflow to your Squirro project.

Upload the workflow using the upload_workflow.py script below. Execute at the location of your workflow (or provide the correct path to your steps):

Code Block
languagebash
python upload_workflow.py --cluster=$cluster \
                          --project-id=$project_id \
                          --token=$token \
                          --config=config.json \ 
                          --custom-steps="."
Expand
titleupload_workflow.py
Code Block
languagepy
import argparse
import json
from pathlib import Path

from squirro_client import SquirroClient

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        "--cluster", required=False, help="Squirro API", default="http://localhost:80"
    )
    parser.add_argument("--project-id", required=True, help="Squirro project ID")
    parser.add_argument("--token", required=True, help="Api Token")
    parser.add_argument(
        "--config", default="config.json", help="Path to workflow configuration"
    )
    parser.add_argument(
        "--custom-steps", default=".", help="Path to custom step implementation"
    )

    args = parser.parse_args()

    client = SquirroClient(None, None, cluster=args.cluster)
    client.authenticate(refresh_token=args.token)
    config = json.load(open(args.config))
    config["dataset"] = {"items": []}
    client.new_machinelearning_workflow(
        project_id=args.project_id,
        name=config.get("name", "Uploaded Ml-Workflow"),
        config=config,
        ml_models=str(Path(args.custom_steps).absolute()) + "/",
        type="query"
    )

Enable Your Custom Workflow for Query Processing

Now switch to the Squirro project in your browser and navigate ML Workflows under the AI STUDIO tab.

Click SET ACTIVE to use your custom workflow for query processing. If you want to change any of the configurations of the uploaded steps, click EDIT.

...

Troubleshooting

How is the workflow executed?

Currently the workflow is integrated into a squirro-application via the natural language understanding plugin as depicted in the overview here. The searchbar reaches first out to the natural language query plugin /parse endpoint that triggers the configured query processing workflow.
→ Check the network tab in your browser and check the API response]

Expand
titleQuery Processing API Response
Code Block
languagejson
{
   "original_query":"how to connect to wlan",
   "language":[
      "en"
   ],
   "type":[
      "question_or_statement"
   ],
   
   "query":"connect^5 wlan^10",
   "user_terms":[
      "how",
      "to",
      "connect",
      "to",
      "wlan"
   ],
   "facet_filters":[
   ],
   "my_classified_topic":['topic:"login tutorial"']
}

Where to find Query Processing Logs?

The machinelearning service is responsible to run the configured query-processing workflow end-to-end and logs debugging and detailed error data.

...

titlePipeline Logs

Query Processing Logs: The pipeline itself logs enriched metadata as configured via the LogFieldDebugger step.

Code Block
languagejson
{
      "step": "debugger",
      "type": "log_fields",
      "fields": [
        "user_terms", 
        "type", 
        "enriched_query"
        "my_classified_topic"], # appended by our classifier
      "log_level": "info"
  }

Relevant log file: /var/log/squirro/machinelearning/machinelearning.log

...

languagebash

...

page can now be found at How to Create a Custom Query Processing Step on the Squirro Docs site.