This guide showcases how to:
write a custom libNLP step to extend the capabilities of the default query processing workflow
upload and set a new query processing workflow on your project
The example in this guide adds a custom Query Classifier to perform the search operation only within a smaller, filtered subset (inferred faceted search). This allows application developers to improve the overall search experience, in this case by returning documents that share the same topic as the user query.
Example:
Query processing input:
main symptoms of flu vs covid
Classified label:topic:"health care"
Query processing output:
(main symptoms of flu vs covid) AND (topic:"health care")
You can also implement any other kind of step, following the steps outlined in this guide.
Here is some inspiration:
Expansion on Abbreviations
Expand on known domain-specific abbreviations (fetching information from a 3rd party system)SQuAD
→(SQuAD OR \"Stanford Question Answering Dataset\")
IT
→(IT OR \"Information Technology\")
Info | ||
---|---|---|
Requirements for to follow all steps in this guide:
|
Workflow Structure and Steps
A workflow is configured via json
format and can be composed by any combination of built-in and custom steps. Custom steps have to be placed together with the configuration in the same folder.
Code Block |
---|
/custom_query_processing
/ config.json
/ my_query_classifier.py |
The provided config.json
is an extension on the default query processing workflow
Built-in query-processing steps are of type:
step:app
&type:query_processing
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Custom Classifier Step
Add the file my_query_classifier.py
as:
...
title | Implementation of a custom QueryClassifier step |
---|
Code Block | ||
---|---|---|
| ||
import functools
import logging
from squirro.lib.nlp.steps.batched_step import BatchedStep
from squirro.lib.nlp.document import Document
from squirro.lib.nlp.utils.cache import CacheDocument
from squirro.common.profiler import SlowLog
from transformers import Pipeline as ZeroShotClassifier
from transformers import pipeline
class MyClassifier(BatchedStep):
"""
Classify query into predefined classes using zero-shot-classification.
Parameters:
input_field (str, "user_terms_str"): raw user query strings
model (str, "valhalla/distilbart-mnli-12-1"): zero shot classification to use
target_facet (str): Target squirro-label used for faceted search
target_classes (list, ["stocks", "sport", "music"]): Possible classes
output_field (str, "my_classified_topic"): new facet filters to append to the query
confidence_threshold (float, 0.3): Use classified labels only if model predicted it with high enough confidence
step (str, "custom"): my classifier
type (str, "classifier"): my classifier
name (str, "my_classifier"): my classifier
path (str, "."): my classifier
"""
def quote_facet_name(self, label):
if len(label.split()) > 1:
label = f'"{label}"'
return label
@CacheDocument
@SlowLog(logger=logging.info, suffix="0-shot-classifier", threshold=100)
def process_doc(self, doc: Document):
try:
classifier: ZeroShotClassifier = self.model_cache.get_and_save_model(
self.model,
functools.partial(
pipeline, task="zero-shot-classification", model=self.model
),
)
except Exception:
logging.exception("Huggingface pipeline crashed")
# make sure that aborted tasks are not used for caching
return doc.abort_processing()
query = doc.fields.get(self.input_field)
predictions = classifier(query, self.target_classes)
value = predictions["labels"][0]
score = predictions["scores"][0]
if score > self.confidence_threshold:
doc.fields[
self.output_field
] = f"{self.target_facet}:{self.quote_facet_name(value)}"
return doc
|
Configuration
You can configure the step in the config.json according the provided step Parameters.
Local Testing
Test your lilbNLP
step locally during development. Instantiate your code and provide a squirro.lib.nlp.document.Document
together with the configuration for the steps you want to test.
Example content for a simple baseline test test_my_classifier.py
:
Code Block | ||
---|---|---|
| ||
from my_query_classifier import MyClassifier
if __name__ == "__main__":
# Documents are tagged with facet called `topic`
target_facet = "topic"
# The facet `topic` can be one of the following values from `target_classes`
target_classes = ['login tutorial', 'sports', 'health care', 'merge and acquisition', 'stock market']
# Instantiate custom classifier step
step = MyClassifier(config={
"target_facet": "topic",
"target_classes": target_classes,
})
# Setup simple test cases
queries = [
"how to connect to wlan",
"elon musk buys shares at twitter",
"main symptoms of flu vs covid"
]
for query in queries:
doc = Document(doc_id="", fields={"user_terms_str": query})
step.process_doc(doc)
print("=================")
print(f"Classified Query")
print(f"\tQuery:\t{query}")
print(f"\tLabel:\t{doc.fields.get('facet_filters')}") |
Demo Output
Code Block | ||
---|---|---|
| ||
$ python test_custom_spacy_normalizer.py
=================
Query Classified
Query: 'how to connect to wlan'
Label: 'topic:"login tutorial"'
=================
Query Classified
Query: 'elon musk buys shares at twitter'
Label: 'topic:"stock market"'
=================
Query Classified
Query: 'main symptoms of flu vs covid'
Label: 'topic:"health care"'
================= |
Upload
You require the token
, cluster
and project_id
to upload the workflow to your Squirro project.
Upload the workflow using the upload_workflow.py
script below. Execute at the location of your workflow (or provide the correct path to your steps):
Code Block | ||
---|---|---|
| ||
python upload_workflow.py --cluster=$cluster \
--project-id=$project_id \
--token=$token \
--config=config.json \
--custom-steps="." |
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Enable Your Custom Workflow for Query Processing
Now switch to the Squirro project in your browser and navigate ML Workflows under the AI STUDIO tab.
Click SET ACTIVE to use your custom workflow for query processing. If you want to change any of the configurations of the uploaded steps, click EDIT.
...
Troubleshooting
How is the workflow executed?
Currently the workflow is integrated into a squirro-application via the natural language understanding plugin as depicted in the overview here. The searchbar reaches first out to the natural language query plugin /parse
endpoint that triggers the configured query processing workflow.
→ Check the network tab in your browser and check the API response]
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Where to find Query Processing Logs?
The machinelearning
service is responsible to run the configured query-processing workflow end-to-end and logs debugging and detailed error data.
...
title | Pipeline Logs |
---|
Query Processing Logs: The pipeline itself logs enriched metadata as configured via the LogFieldDebugger step.
Code Block | ||
---|---|---|
| ||
{
"step": "debugger",
"type": "log_fields",
"fields": [
"user_terms",
"type",
"enriched_query"
"my_classified_topic"], # appended by our classifier
"log_level": "info"
} |
Relevant log file: /var/log/squirro/machinelearning/machinelearning.log
...
language | bash |
---|
...
page can now be found at How to Create a Custom Query Processing Step on the Squirro Docs site.