Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Current »

This guide showcases how to:

  • write a custom libNLP step to extend the capabilities of the default query processing workflow

  • upload and set a new query processing workflow on your project

The example in this guide adds a custom Query Classifier to perform the search operation only within a smaller, filtered subset (inferred faceted search). This allows application developers to improve the overall search experience, in this case by returning documents that share the same topic as the user query.

Example:

  • Query processing input:
    main symptoms of flu vs covid
    Classified label: topic:"health care"

  • Query processing output:
    (main symptoms of flu vs covid) AND (topic:"health care")

You can also implement any other kind of step, following the steps outlined in this guide.

Here is some inspiration:

  • Expansion on Abbreviations
    Expand on known domain-specific abbreviations (fetching information from a 3rd party system)
    SQuAD(SQuAD OR \"Stanford Question Answering Dataset\")
    IT(IT OR \"Information Technology\")

Requirements for to follow all steps in this guide:

  • Local installation of Squirro Toolbox

  • Local installation of libNLP

  • Local installed en_core_web_sm SpaCy model

    python -m spacy download en_core_web_sm

Workflow Structure and Steps

A workflow is configured via json format and can be composed by any combination of built-in and custom steps. Custom steps have to be placed together with the configuration in the same folder.

/custom_query_processing
  / config.json
  / my_query_classifier.py

The provided config.json is an extension on the default query processing workflow

  • Built-in query-processing steps are of type: step:app & type:query_processing

 config.json Overall
{
    "cacheable": true,
    "pipeline": [
        {
            "fields": ["query"],
            "step": "loader",
            "type": "squirro_item"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "syntax_parser"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "lang_detection",
            "fallback_language": "en"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "custom_spacy_normalizer",
            "cache_document": true,
            "infix_split_hyphen": false,
            "infix_split_chars": ":<>=",
            "merge_noun_chunks": false,
            "merge_phrases": true,
            "merge_entities": true,
            "fallback_language": "en",
            "exclude_spacy_pipes": [],
            "spacy_model_mapping": {
                "en": "en_core_web_sm",
                "de": "de_core_news_sm"
            }
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "pos_booster",
            "phrase_proximity_distance": 15,
            "pos_weight_map": {
                "PROPN": 10,
                "NOUN": 10,
                "VERB": 2,
                "ADJ": 5,
                "X": "-",
                "NUM": "-",
                "SYM": "-"
            }
        },
        {
            "step": "custom",
            "type": "classifier",
            "name": "my_query_classifier",
            "model": "valhalla/distilbart-mnli-12-1",
            "target_facet":"topic",
            "target_classes": ["login tutorial", "sports", "health care", "merge and acquisition", "stock market"],
            "output_field": "my_classified_topic"
        },
        {
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier",
            "term_mutations_metadata": [
              "pos_mutations",
              "my_classified_topic"
            ]
        },
        {
            "step": "debugger",
            "type": "log_fields",
            "fields": ["user_terms", "facet_filters", "pos_mutations", "type", "enriched_query","my_classified_topic"],
            "log_level": "info"
        }
    ]
}

 Reference: custom step config
# 1) Custom step that appends writes metadata: `my_classified_topic`
{
    "step": "custom",
    "type": "classifier",
    "name": "my_query_classifier",
    "model": "valhalla/distilbart-mnli-12-1",
    "target_facet":"topic",
    "target_classes": ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market'],
    "output_field": "my_classified_topic"
},

# 2) The built-in `query_modifier` step rewrites the original query based on metadata added in prior steps in the pipeline 
#    -> like: `query = f"{original_query} AND {my_classified_topic}"`
{
            "step": "app",
            "type": "query_processing",
            "name": "query_modifier",
            "term_mutations_metadata": [
              "pos_mutations",
              "my_classified_topic"
            ]
}

Custom Classifier Step

Add the file my_query_classifier.py as:

 Implementation of a custom QueryClassifier step
import functools
import logging

from squirro.lib.nlp.steps.batched_step import BatchedStep
from squirro.lib.nlp.document import Document
from squirro.lib.nlp.utils.cache import CacheDocument

from squirro.common.profiler import SlowLog

from transformers import Pipeline as ZeroShotClassifier
from transformers import pipeline


class MyClassifier(BatchedStep):
    """
    Classify query into predefined classes using zero-shot-classification.


    Parameters:
        input_field (str, "user_terms_str"): raw user query strings
        model (str, "valhalla/distilbart-mnli-12-1"): zero shot classification to use
        target_facet (str): Target squirro-label used for faceted search
        target_classes (list, ["stocks", "sport", "music"]): Possible classes
        output_field (str, "my_classified_topic"): new facet filters to append to the query
        confidence_threshold (float, 0.3): Use classified labels only if model predicted it with high enough confidence
        step (str, "custom"): my classifier
        type (str, "classifier"): my classifier
        name (str, "my_classifier"): my classifier
        path (str, "."): my classifier
    """

    def quote_facet_name(self, label):
        if len(label.split()) > 1:
            label = f'"{label}"'
        return label

    @CacheDocument
    @SlowLog(logger=logging.info, suffix="0-shot-classifier", threshold=100)
    def process_doc(self, doc: Document):
        try:
            classifier: ZeroShotClassifier = self.model_cache.get_and_save_model(
                self.model,
                functools.partial(
                    pipeline, task="zero-shot-classification", model=self.model
                ),
            )
        except Exception:
            logging.exception("Huggingface pipeline crashed")
            # make sure that aborted tasks are not used for caching
            return doc.abort_processing()

        query = doc.fields.get(self.input_field)
        predictions = classifier(query, self.target_classes)
        value = predictions["labels"][0]
        score = predictions["scores"][0]

        if score > self.confidence_threshold:
            doc.fields[
                self.output_field
            ] = f"{self.target_facet}:{self.quote_facet_name(value)}"
        return doc


Configuration

You can configure the step in the config.json according the provided step Parameters.

Local Testing

Test your lilbNLP step locally during development. Instantiate your code and provide a squirro.lib.nlp.document.Document together with the configuration for the steps you want to test.

Example content for a simple baseline test test_my_classifier.py:

from my_query_classifier import MyClassifier

if __name__ == "__main__":
    # Documents are tagged with facet called `topic`
    target_facet = "topic"
    # The facet `topic` can be one of the following values from `target_classes`
    target_classes = ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market']

    # Instantiate custom classifier step
    step = MyClassifier(config={
        "target_facet": "topic",
        "target_classes": target_classes,
    })

    # Setup simple test cases
    queries = [
        "how to connect to wlan",
        "elon musk buys shares at twitter",
        "main symptoms of flu vs covid"
    ]

    for query in queries:
        doc = Document(doc_id="", fields={"user_terms_str": query})
        step.process_doc(doc)
        print("=================")
        print(f"Classified Query")
        print(f"\tQuery:\t{query}")
        print(f"\tLabel:\t{doc.fields.get('facet_filters')}")

Demo Output

$ python test_custom_spacy_normalizer.py

 =================
 Query Classified
        Query:  'how to connect to wlan'
        Label:  'topic:"login tutorial"'
 =================
 Query Classified
        Query:  'elon musk buys shares at twitter'
        Label:  'topic:"stock market"'
 =================
 Query Classified
        Query:  'main symptoms of flu vs covid'
        Label:  'topic:"health care"'
 =================

Upload

You require the token, cluster and project_id to upload the workflow to your Squirro project.

Upload the workflow using the upload_workflow.py script below. Execute at the location of your workflow (or provide the correct path to your steps):

python upload_workflow.py --cluster=$cluster \
                          --project-id=$project_id \
                          --token=$token \
                          --config=config.json \ 
                          --custom-steps="."
 upload_workflow.py
import argparse
import json
from pathlib import Path

from squirro_client import SquirroClient

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        "--cluster", required=False, help="Squirro API", default="http://localhost:80"
    )
    parser.add_argument("--project-id", required=True, help="Squirro project ID")
    parser.add_argument("--token", required=True, help="Api Token")
    parser.add_argument(
        "--config", default="config.json", help="Path to workflow configuration"
    )
    parser.add_argument(
        "--custom-steps", default=".", help="Path to custom step implementation"
    )

    args = parser.parse_args()

    client = SquirroClient(None, None, cluster=args.cluster)
    client.authenticate(refresh_token=args.token)
    config = json.load(open(args.config))
    config["dataset"] = {"items": []}
    client.new_machinelearning_workflow(
        project_id=args.project_id,
        name=config.get("name", "Uploaded Ml-Workflow"),
        config=config,
        ml_models=str(Path(args.custom_steps).absolute()) + "/",
        type="query"
    )

Enable Your Custom Workflow for Query Processing

Now switch to the Squirro project in your browser and navigate ML Workflows under the AI STUDIO tab.

Click SET ACTIVE to use your custom workflow for query processing. If you want to change any of the configurations of the uploaded steps, click EDIT.

Troubleshooting

How is the workflow executed?

Currently the workflow is integrated into a squirro-application via the natural language understanding plugin as depicted in the overview here. The searchbar reaches first out to the natural language query plugin /parse endpoint that triggers the configured query processing workflow.
→ Check the network tab in your browser and check the API response]

 Query Processing API Response
{
   "original_query":"how to connect to wlan",
   "language":[
      "en"
   ],
   "type":[
      "question_or_statement"
   ],
   
   "query":"connect^5 wlan^10",
   "user_terms":[
      "how",
      "to",
      "connect",
      "to",
      "wlan"
   ],
   "facet_filters":[
   ],
   "my_classified_topic":['topic:"login tutorial"']
}

Where to find Query Processing Logs?

The machinelearning service is responsible to run the configured query-processing workflow end-to-end and logs debugging and detailed error data.

 Pipeline Logs

Query Processing Logs: The pipeline itself logs enriched metadata as configured via the LogFieldDebugger step.

{
      "step": "debugger",
      "type": "log_fields",
      "fields": [
        "user_terms", 
        "type", 
        "enriched_query"
        "my_classified_topic"], # appended by our classifier
      "log_level": "info"
  }

Relevant log file: /var/log/squirro/machinelearning/machinelearning.log

2021-12-08 22:42:47,507 | worker 7 [fields_debugger::_log_fields()::26] INFO     Logging fields for Document 'query' (skipped=False)
2021-12-08 22:42:47,506 | worker 7 [fields_debugger::_log_fields()::25] INFO     ++++++++++++++++++++++++++++++++++++++++++++++++++++
2021-12-08 22:42:47,513 | worker 7 [fields_debugger::_log_fields()::38] INFO     'type' ----> 'question_or_statement'
2021-12-08 22:42:47,514 | worker 7 [fields_debugger::_log_fields()::38] INFO     'my_classified_topic' ----> 'topic:"login tutorial"'
2021-12-08 22:42:47,514 | worker 7 [fields_debugger::_log_fields()::35] INFO     'enriched_query' (truncated) ----> 'how to connect to wlan topic:"login tutorial"'

  • No labels