Tagging a dataset is done to create a set of training examples for text classification problems.
Our goal in the tagging process is to attach a label to as many training examples as possible.
- Label - The correct classification associated with an example. When humans add classes to examples manually, the classes that they add are labels
- Tag - A classification for an example that is guessed by a model
The first step in the process is to load the sample data that you want to tag into a squirro project.
There is only one field that matters at the time of data loading, which is the abstract.
The abstract field is used by the widget as the input text which the label gets attached to.
For best results, the length of the examples to be labeled should be kept as short as possible, with the ideal length being 1-2 sentences. In most cases, this set of examples is created by splitting up larger documents into multiple examples, such as by using a pipelet to split on sentences or paragraphs.
Once added, the data set should look roughly like this:
Note: In order to work as expected, the contents of the abstract and body of each item should be the same.
Setting up a training and inference job
The widget for tagging datasets can be found here: https://github.com/squirro/delivery/tree/master/dashboard/widgets/dataset-tagging
and is added to a project as a custom widget. Details on how to upload a custom widget can be found here: squirro_asset Command Line Reference#Dashboardwidgets
With data loaded into the project, we can move to setting up the widget and starting to tag examples.
The data labeling widget takes a few config options:
Facet Name - The Facet that stores the labels added by humans
Tag Facet Name - The Facet that stores the tags predicted by the model
Labels to use - The options for different classes, if they have not already been predicted by a model.
- For example, if you have the classes "pos" and "neg", you can fill in this config option with the value "pos,neg" to tell that to the widget
Show bulk tagging controls - If selected, a black bar will appear at the top of the widget, with the option to label the top 10 examples shown with a single click.
If training and inference jobs are already set up, you will see the prediction strength for each class for each example in the project (the darker a class shows up, the more strongly the model predicts that label). At this point, you are ready to start labeling examples and training the model.
It is typically easiest to use the search bar to find good examples to tag first. Any type of search can be used along with the data labeling widget, so there are lots of clever ways of finding good examples for each class to start with. Such as:
- Simple keyword search
- Making a smartfilter
- Looking for examples where the current model is really confident already:
- Looking for examples where the current model is totally unsure:
label_pred:CLASS>0.45 AND label_pred:CLASS<0.55
In general, if you want to tag the examples that will have the greatest overall improvement in the quality of the model, you should look for examples where either:
- The model is very confident, but is incorrect in its prediction
- The model is very unsure about which class the example fits into