This guide showcases how to:
The example in this guide adds a custom Query Classifier to perform the search operation only within a smaller, filtered subset (inferred faceted search). This allows application developers to improve the overall search experience, in this case by returning documents that share the same topic as the user query.
Example:
You can also implement any other kind of step, following the steps outlined in this guide.
Here is some inspiration:
Workflow Structure and Steps
A workflow is configured via json
format and can be composed by any combination of built-in and custom steps. Custom steps have to be placed together with the configuration in the same folder.
/custom_query_processing
/ config.json
/ my_query_classifier.py
The provided config.json
is an extension on the default query processing workflow
config.json Overall
{
"cacheable": true,
"pipeline": [
{
"fields": ["query"],
"step": "loader",
"type": "squirro_item"
},
{
"step": "app",
"type": "query_processing",
"name": "syntax_parser"
},
{
"step": "app",
"type": "query_processing",
"name": "lang_detection",
"fallback_language": "en"
},
{
"step": "app",
"type": "query_processing",
"name": "custom_spacy_normalizer",
"cache_document": true,
"infix_split_hyphen": false,
"infix_split_chars": ":<>=",
"merge_noun_chunks": false,
"merge_phrases": true,
"merge_entities": true,
"fallback_language": "en",
"exclude_spacy_pipes": [],
"spacy_model_mapping": {
"en": "en_core_web_sm",
"de": "de_core_news_sm"
}
},
{
"step": "app",
"type": "query_processing",
"name": "pos_booster",
"phrase_proximity_distance": 15,
"pos_weight_map": {
"PROPN": 10,
"NOUN": 10,
"VERB": 2,
"ADJ": 5,
"X": "-",
"NUM": "-",
"SYM": "-"
}
},
{
"step": "custom",
"type": "classifier",
"name": "my_query_classifier",
"model": "valhalla/distilbart-mnli-12-1",
"target_facet":"topic",
"target_classes": ["login tutorial", "sports", "health care", "merge and acquisition", "stock market"],
"output_field": "my_classified_topic"
},
{
"step": "app",
"type": "query_processing",
"name": "query_modifier",
"term_mutations_metadata": [
"pos_mutations",
"my_classified_topic"
]
},
{
"step": "debugger",
"type": "log_fields",
"fields": ["user_terms", "facet_filters", "pos_mutations", "type", "enriched_query","my_classified_topic"],
"log_level": "info"
}
]
}
Reference: custom step config
# 1) Custom step that appends writes metadata: `my_classified_topic`
{
"step": "custom",
"type": "classifier",
"name": "my_query_classifier",
"model": "valhalla/distilbart-mnli-12-1",
"target_facet":"topic",
"target_classes": ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market'],
"output_field": "my_classified_topic"
},
# 2) The built-in `query_modifier` step rewrites the original query based on metadata added in prior steps in the pipeline
# -> like: `query = f"{original_query} AND {my_classified_topic}"`
{
"step": "app",
"type": "query_processing",
"name": "query_modifier",
"term_mutations_metadata": [
"pos_mutations",
"my_classified_topic"
]
}
Custom Classifier Step
Add the file my_query_classifier.py
as:
Implementation of a custom QueryClassifier step
import functools
import logging
from squirro.lib.nlp.steps.batched_step import BatchedStep
from squirro.lib.nlp.document import Document
from squirro.lib.nlp.utils.cache import CacheDocument
from squirro.common.profiler import SlowLog
from transformers import Pipeline as ZeroShotClassifier
from transformers import pipeline
class MyClassifier(BatchedStep):
"""
Classify query into predefined classes using zero-shot-classification.
Parameters:
input_field (str, "user_terms_str"): raw user query strings
model (str, "valhalla/distilbart-mnli-12-1"): zero shot classification to use
target_facet (str): Target squirro-label used for faceted search
target_classes (list, ["stocks", "sport", "music"]): Possible classes
output_field (str, "my_classified_topic"): new facet filters to append to the query
confidence_threshold (float, 0.3): Use classified labels only if model predicted it with high enough confidence
step (str, "custom"): my classifier
type (str, "classifier"): my classifier
name (str, "my_classifier"): my classifier
path (str, "."): my classifier
"""
def quote_facet_name(self, label):
if len(label.split()) > 1:
label = f'"{label}"'
return label
@CacheDocument
@SlowLog(logger=logging.info, suffix="0-shot-classifier", threshold=100)
def process_doc(self, doc: Document):
try:
classifier: ZeroShotClassifier = self.model_cache.get_and_save_model(
self.model,
functools.partial(
pipeline, task="zero-shot-classification", model=self.model
),
)
except Exception:
logging.exception("Huggingface pipeline crashed")
# make sure that aborted tasks are not used for caching
return doc.abort_processing()
query = doc.fields.get(self.input_field)
predictions = classifier(query, self.target_classes)
value = predictions["labels"][0]
score = predictions["scores"][0]
if score > self.confidence_threshold:
doc.fields[
self.output_field
] = f"{self.target_facet}:{self.quote_facet_name(value)}"
return doc
Configuration
You can configure the step in the config.json according the provided step Parameters.
Local Testing
Test your lilbNLP
step locally during development. Instantiate your code and provide a squirro.lib.nlp.document.Document
together with the configuration for the steps you want to test.
Example content for a simple baseline test test_my_classifier.py
:
from my_query_classifier import MyClassifier
if __name__ == "__main__":
# Documents are tagged with facet called `topic`
target_facet = "topic"
# The facet `topic` can be one of the following values from `target_classes`
target_classes = ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market']
# Instantiate custom classifier step
step = MyClassifier(config={
"target_facet": "topic",
"target_classes": target_classes,
})
# Setup simple test cases
queries = [
"how to connect to wlan",
"elon musk buys shares at twitter",
"main symptoms of flu vs covid"
]
for query in queries:
doc = Document(doc_id="", fields={"user_terms_str": query})
step.process_doc(doc)
print("=================")
print(f"Classified Query")
print(f"\tQuery:\t{query}")
print(f"\tLabel:\t{doc.fields.get('facet_filters')}")
Demo Output
$ python test_custom_spacy_normalizer.py
=================
Query Classified
Query: 'how to connect to wlan'
Label: 'topic:"login tutorial"'
=================
Query Classified
Query: 'elon musk buys shares at twitter'
Label: 'topic:"stock market"'
=================
Query Classified
Query: 'main symptoms of flu vs covid'
Label: 'topic:"health care"'
=================
Upload
You require the token
, cluster
and project_id
to upload the workflow to your Squirro project.
Upload the workflow using the upload_workflow.py
script below. Execute at the location of your workflow (or provide the correct path to your steps):
python upload_workflow.py --cluster=$cluster \
--project-id=$project_id \
--token=$token \
--config=config.json \
--custom-steps="."
upload_workflow.py
import argparse
import json
from pathlib import Path
from squirro_client import SquirroClient
if __name__ == "__main__":
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument(
"--cluster", required=False, help="Squirro API", default="http://localhost:80"
)
parser.add_argument("--project-id", required=True, help="Squirro project ID")
parser.add_argument("--token", required=True, help="Api Token")
parser.add_argument(
"--config", default="config.json", help="Path to workflow configuration"
)
parser.add_argument(
"--custom-steps", default=".", help="Path to custom step implementation"
)
args = parser.parse_args()
client = SquirroClient(None, None, cluster=args.cluster)
client.authenticate(refresh_token=args.token)
config = json.load(open(args.config))
config["dataset"] = {"items": []}
client.new_machinelearning_workflow(
project_id=args.project_id,
name=config.get("name", "Uploaded Ml-Workflow"),
config=config,
ml_models=str(Path(args.custom_steps).absolute()) + "/",
type="query"
)
Enable Your Custom Workflow for Query Processing
Now switch to the Squirro project in your browser and navigate ML Workflows under the AI STUDIO tab.
Click SET ACTIVE to use your custom workflow for query processing. If you want to change any of the configurations of the uploaded steps, click EDIT.
Troubleshooting
How is the workflow executed?
Currently the workflow is integrated into a squirro-application via the natural language understanding plugin as depicted in the overview here. The searchbar reaches first out to the natural language query plugin /parse
endpoint that triggers the configured query processing workflow.
→ Check the network tab in your browser and check the API response]
Query Processing API Response
{
"original_query":"how to connect to wlan",
"language":[
"en"
],
"type":[
"question_or_statement"
],
"query":"connect^5 wlan^10",
"user_terms":[
"how",
"to",
"connect",
"to",
"wlan"
],
"facet_filters":[
],
"my_classified_topic":['topic:"login tutorial"']
}
Where to find Query Processing Logs?
The machinelearning
service is responsible to run the configured query-processing workflow end-to-end and logs debugging and detailed error data.
Pipeline Logs
Query Processing Logs: The pipeline itself logs enriched metadata as configured via the LogFieldDebugger step.
{
"step": "debugger",
"type": "log_fields",
"fields": [
"user_terms",
"type",
"enriched_query"
"my_classified_topic"], # appended by our classifier
"log_level": "info"
}
Relevant log file: /var/log/squirro/machinelearning/machinelearning.log
2021-12-08 22:42:47,507 | worker 7 [fields_debugger::_log_fields()::26] INFO Logging fields for Document 'query' (skipped=False)
2021-12-08 22:42:47,506 | worker 7 [fields_debugger::_log_fields()::25] INFO ++++++++++++++++++++++++++++++++++++++++++++++++++++
2021-12-08 22:42:47,513 | worker 7 [fields_debugger::_log_fields()::38] INFO 'type' ----> 'question_or_statement'
2021-12-08 22:42:47,514 | worker 7 [fields_debugger::_log_fields()::38] INFO 'my_classified_topic' ----> 'topic:"login tutorial"'
2021-12-08 22:42:47,514 | worker 7 [fields_debugger::_log_fields()::35] INFO 'enriched_query' (truncated) ----> 'how to connect to wlan topic:"login tutorial"'