How-to use MaaS with in Squirro

works with Squirro 3.5.2 and newer

Model-as-a-Service (MaaS) is an initiative to open up Squirro for custom ML models and speed up the prototyping phase for ML projects in Squirro.

Preparation

Before you can use MaaS, you need to install the required packages from the Squirro mirror on your target Squirro server:

yum install squirro-miniforge
yum install squirro-python38-mlflow

Creation of a MLFlow Model

Before you can upload a model, you need to create a MLFlow Model and example can be found here. This can be done in two ways:

train a MLFlow model on your local machine or on your exploration server
wrap an existing (pre-trained) model into the structure of a MLFlow Model and run it locally

Either way MLFlow stores the (trained) model in the MLFlow base folder (mlruns/0/) with an unique hash (<HASH>) after executing the run-command. The most minimal structure of the MLFlow Model looks as follow:

├── artifacts
│   └── model
│       ├── conda.yaml
│       ├── MLmodel
│       ├── python_model.pkl
│       └── requirements.txt
└── meta.yaml

MLFlow documentation

built-in model flavors (to write a own model)
train a MLFlow Model
packaging training code for MLFlow Model

Data Structure

To use the MLFlow Model later in the context of a Squirro ML Workflow you need to stick to a specific data structure:

the input is a pandas dataframe with an id and named feature fields as columns
the output is again a pandas dataframe with an id and result fields as columns

Example:

input DataFrame

    id                         text
0  id0  this is a example sentence.
1  id1                 hello world.
2  id2             random sentence.
3  id3               test sentence.

output DataFrame

    id   class
0  id0  class1
1  id1  class0
2  id2  class0
3  id3  class1

Upload of a Model

To upload the MLFlow Model we provide two options:

via squirro asset (large models >500MB (exact number is under revision) can cause nginx issues → then use scp):
- go into the MLFlow base folder
- send the (trained) model via squirro_asset
```
squirro_asset -vvv mlflow_models upload -t $TOKEN -c $CLUSTER -f mlruns/0/<HASH>/
```

via scp:

go into the MLFlow base folder

assure that destination directory exists (on the Squirro server)

<BASE_DIR>=/var/lib/squirro/topic/assets/mlflow_models # default path
mkdir -p <BASE_DIR>/mlruns/0

compress the directory with the (trained) model (wherever you have trained your model)
```
cd mlruns/0/ && tar -czvf trained_model.tar.gz <HASH>/
```
send it to the MLFlow base folder on the Squirro server
```
scp trained_model.zip <SQUIRRO_SERVER_URL>:/tmp/ 
```

ssh into the Squirro server and unzip the sent file

cd <BASE_DIR>/mlruns/0 && 
mv /tmp/trained_model.tar.gz <BASE_DIR>/mlruns/0/  #create the dirs if not existing
tar -xzvf trained_model.tar.gz

adjust artifact_uri in the meta.yaml with the new path of the MLFlow Model (file:///<BASE_DIR>/mlruns/0/<HASH>/artifacts)
```
sed -i '/artifact_uri/c\artifact_uri: file:///<BASE_DIR>/mlruns/0/<HASH>/artifacts' <HASH>/meta.yaml
```

Starting of Service

To start a Model-as-a-Service you need to execute following steps:

make sure you are in the MLFlow base folder on the Squirro server
activate the squirro environment
```
squirro_activate3
```
serve the model identified by the <HASH> as a service listening to the chosen port <PORT>
```
mlflow models serve -m runs:/<HASH>/model -p <PORT>
```
- use nohup or screen when starting the service so the MaaS does not stop when you terminate your ssh session

Note

there is no service orchestration provided at this stage
keep an eye on memory and storage consumption. Then among others:
- a started model service loads the model in memory and keeps it there
- there is a new conda environment created for every new model which has a different conda.yaml file
on-premise customers need to manually package their conda environment. This can be done as explained here.

Usage of MaaS

To use the Model you need to create a ML Workflow:

example 1: document level

{
    "dataset": {
        "infer": {
            "count": 10,
            "query_string": "language:en"
        }
    },
    "pipeline": [
        {
            "fields": [
                "body"
            ],
            "step": "loader",
            "type": "squirro_query"
        },
        {
            "fields": [
                "body"
            ],
            "step": "filter",
            "type": "empty"
        },
        {
            "input_mapping": {
                "body":"text"
            },
            "output_mapping": {
                "class":"keywords.prediction"
            },
            "process_endpoint": "http://localhost:<PORT>/invocations",
            "name": "mlflow_maas",
            "step": "mlflow_maas",
            "type": "mlflow_maas"
        },
        {
            "fields": [
                "keywords.prediction"
            ],
            "step": "saver",
            "type": "squirro_item"
        }
    ]
}

example 2: sentence Level example with entity generation

{
    "dataset": {
        "infer": {
            "count": 10,
            "query_string": "language:en"
        }
    },
    "pipeline": [
        {
            "fields": [
                "body"
            ],
            "step": "loader",
            "type": "squirro_query"
        },
        {
            "fields": [
                "body"
            ],
            "step": "filter",
            "type": "empty"
        },
        {
            "input_fields": [
                "body"
            ],
            "output_fields": [
                "extract_sentences"
            ],
            "step": "tokenizer",
            "type": "sentences_nltk"
        },
        {
            "fields": [
                "extract_sentences"
            ],
            "step": "filter",
            "type": "doc_split"
        },
        {
            "input_mapping": {
                "extract_sentences":"text"
            },
            "output_mapping": {
                "class":"prediction"
            },
            "process_endpoint": "http://localhost:<PORT>/invocations",
            "name": "mlflow_maas",
            "step": "mlflow_maas",
            "type": "mlflow_maas"
        },
        {
            "fields": [
                "extract_sentences",
                "prediction"
            ],
            "step": "filter",
            "type": "doc_join"
        },
        {
            "entity_name_field": "Catalyst",
            "entity_type": "Catalyst",
            "excluded_values": [],
            "extract_field": "extract_sentences",
            "format_values": false,
            "global_property_field_map": {},
            "modes": [
                "process"
            ],
            "property_field_map": {
                "Catalyst": [
                    "prediction"
                ]
            },
            "required_properties": [
                "Catalyst"
            ],
            "source_field": "body",
            "step": "filter",
            "type": "squirro_entity"
        },
        {
            "fields": [
                "entities"
            ],
            "step": "saver",
            "type": "squirro_item"
        }
    ]
}

These ML Workflows can then be used as inference ML Jobs scheduled in an interval or as a published model in the enrich pipeline (How-to Publish ML Models Using the Squirro Client).