KEE Tutorial

This tutorial provides an example of the process for setting up, configuring, and deploying a simple Known Entity Extraction (KEE) project to a Squirro server.

Each KEE project is a tool which analyzes all incoming documents to a Squirro project, and identifies each instance of a known entity (such as a company, product, person, etc.) by tagging the document that contains the known entity with a specific metadata tag. Once known entities are identified, documents can easily be filtered and grouped within a Squirro project based on the known entities that they contain.

Webinar

The example from this tutorial was also shown in the technical partner webinar in February 2016. Please see the KEE Webinar page for a recording.

A Simple Example

As an easy example, let's take a CSV file that includes a list of salespeople employed by a company.

For each salesperson, their name, a unique ID, email address, position, and their manager are provided. The basic layout of the CSV file is shown below:

salespeople.csv

id,name,email,position,manager
1,John Smith,jsmith@company.com,District Manager,David Cole
2,Jane Doe,jdoe@company.com,District Manager,David Cole
3,Adrian Fox,afox@company.com,District Representative,Jane Doe
4,David Cole, dcole@company.com, CEO

Our goal in this KEE project is to identify a salesperson any time that they are referenced in a document, and tag the document with the name of the salesperson, their position, and the name and position of their manager.

Setting up A KEE Project

In order to make organization easy, each KEE project is stored in its own folder.

So for example, if we had a separate KEE project for identifying specific products sold by the same company, we would have a second KEE project folder for the other KEE project.

Before starting it is advisable to get familiar with the configuration file options in order to set KEE up in the best way matching your project.

Setting up the KEE Project folder

For this project, we will do all of our work in a new folder called kee_salespeople. Within this folder we want to create the following content:

salespeople.csv file - This is the file that contains the list of all the salespeople we want to identify.
config.json file - This is the configuration file for the KEE project that describes how we want the KEE project to operate. You can customize the rules for each KEE project and make tweaks to how entities are identified by changing this file.
fixtures/ folder - This contains the test items that we will use to configure the KEE project

Setting up the initial Configuration

Every KEE project requires a configuration file which describes the specific rules for how the KEE project identifies known entities and interacts with a Squirro server

To start, we point the KEE configuration to our list of entities by adding a source to the config.json file as shown below:

config.json

{
    "sources": {
        "salespeople": {
            "source_type": "csv",
            "source_file": "salespeople.csv",

            "field_id": "name",
            "field_matching": ["name", "email"],
            "hierarchy": "manager -> name",
        }
    }
}

What the above code does is create a new source of known entities called "salespeople", and load the CSV file salespeople.csv which is located in the same folder as the config.json file.

The field_id field identifies the field "name" as being the unique identifier for each entity in the CSV file.

the field_matching field provides a list of all the fields that we want to look for to identify a known entity within a document. In this case, we want to look for references to either the salesperson's name, or their email address in the documents in our Squirro project, so we include both of those fields in a list.

The hierarchy field indicates that there is a hierarchy within the entities in the CSV file, where the value in the 'manager' field of one entity points to the name of that entity's parent entity (the person's manager).

Creating a strategy

Once we have the KEE project pointed to the list of known entities, we want to create our first strategy for recognizing known entities within each document.

We do this by adding an entry to the "strategies" section of the config.json file, as shown below:

config.json

{
    "sources": {
        "salespeople": {
            "source_type": "csv",
            "source_file": "salespeople.csv",

            "field_id": "name",
            "field_matching": ["name", "email"],
			"hierarchy": "manager -> name",
            "strategy": "salesperson_strategy"
        }
    },

    "strategies": {
        "salesperson_strategy": {
            "min_score": 0.9,
            "keywords": [
                "name",
                "position"
            ],
            "parent_keywords": [
                "name -> manager name",
                "position -> manager position"
            ],
        }
    }
}

The code added above creates a strategy called 'salesperson_strategy' for identifying entities and applies it to the source 'salespeople'.

We also set a few basic parameters for this new strategy, such as the minimum score required to produce a match, which we set to 0.9 (make sure that it's a number and not in quotes).

Finally, we set the keyword items to tag each match with the name and position of both the matching entity (the salesperson) and the matching parent entity (the salesperson's manager). In this case, we tag each document with the name and position of the matching entity in facets called 'name' and 'position' respectively, while we tag each document with the name and position of the matching parent entity in facets called 'manager name' and 'manager position' respectively. By default each keyword item must match the exact name from the data source header column. For naming the keyword items differently please see section "Keywords" in configuration file.

Testing a KEE Project

Once we have our list of entities to extract and our initial configuration file, we can begin testing the KEE project and adjusting the settings for our specific use case.

Compiling the lookup database

The first step in testing our KEE project is to compile our lookup database for the entities using the KEE command line tool. This is accomplished by running the following command (with the present working directory set to the folder for our KEE project):

>> squirro_kee compile

After running this command, a db/ folder will be created within our KEE project, this folder includes the lookup database (lookup.json) used to identify known entities. If we take a look at this file, we will see a lookup entry for each of the entities within the csv file, as well as the details for each entity within the csv file. For example:

Lookup Database

{
  "lookup": {
    "[\"default\", [\"default\"], false]": {
      "jdoe": [
        "Jane Doe"
		...

  "entries": {
    "Jane Doe": {
      "parent_id": "David Cole",
      "data": {
        "email": "jdoe@company.com",
        "manager": "David Cole",
        "position": "District Manager",
        "id": "2",
        "name": "Jane Doe"
      },
      "names": [
        "Jane Doe",
        "jdoe@company.com"
      ],
      "strategy": "salesperson_strategy"
    ...

It is important to remember that any time we make changes to the config.json file, we have to rerun "kee compile" for those changes to take affect.

Creating Fixtures

Once we have our lookup database compiled, we can begin testing the KEE project by running example squirro items, or "fixtures" through it. Generally, we store the fixtures that we use for testing in a folder within the KEE project called fixtures/. Each file within the fixtures folder is a JSON document which includes an example squirro item that we want to use to test the KEE, and a list of all the tags that we expect to be added to that item as a result of the KEE process.

An example of a fixture is shown below:

Example Fixture

{
    "item": {
        "body": "I spoke with Adrian Fox on the phone earlier this morning..."
    },
    "keywords": {
        "name": ["Adrian Fox"],
        "position": ["District Representative"],
        "manager name": ["Jane Doe"],
        "manager position": ["District Manager"]
    }
}

In this fixture, the item field includes the data for the squirro item that we want to use to test the KEE strategy that we have created.

The keywords field includes all of the tags that we expect to be added to the document as a result of the KEE. In this case, we expect the KEE to correctly identify the salesperson "Adrian Fox" within the document, and tag it with the salesperson's name and position, as well as the name and position of their manager.

To make a fixture, you can either create it yourself using your favorite text editor, or you can use the kee tool to create one from an existing Squirro item.

Adding Squirro Server Configuration

In order for the kee tool to be able to create fixtures from a Squirro project, the squirro section must be added to the config.json file, as shown below:

config.json

{
    "squirro": {
        "cluster": "http://www.example.com/squirro/",
        "token": "abcabcexampletoken123123123",
		"project_id": "example_project"
    },
        
    "sources": {
        "salespeople": {
...

This will provide the kee tool with everything that it needs to authenticate with the Squirro server and download individual items to create fixtures. Please see Connecting to Squirro for more information on how to get this information.

Fixtures are created using this method by running the kee get_fixture command:

>> squirro_kee get_fixture 'pRVNr9H7QJG_UXhwHjQH3A' '2aEdt4H0R7uwVfbScYadpA'

The above command creates a fixture for each of the two Squirro items indicated by the unique IDs present in the list. Unique IDs for a Squirro item can be found at the end of the URL which appears in a browser when an item is selected on the search page (see screenshot below).

Testing with Fixtures

Once we have a fixture created, we can use the kee command line tool to test the KEE extraction on the fixture. To test a KEE project using the set of fixtures within that KEE project folder, we run the command:

>> squirro_kee -v test

This will produce a basic summary output that shows which tags (keywords) were added to each fixture by the KEE. For our example KEE project, running this command produces:

>> squirro_kee -v test
 
- Running fixture demo
  -    4 (100%) correct results:
        [u'Adrian Fox', u'District Manager', u'District Representative', u'Jane Doe']
  -    0 (  0%) missed results: []
  -    0 (  0%) extra results: []
- Processed 1 fixtures
  -    4 (100%) correct results
  -    0 (  0%) missed results
  -    0 (  0%) extra results

This result shows us that the KEE worked as we intended it to, and that all of the tags that we expected to find were correctly added by the KEE.

If testing the KEE produces results which are different from what is expected, adjustments can be made to the config.json file to improve the results for each specific use case by modifying the way that KEE works. More information on this process can be found in the KEE Testing Documentation.

Deploying a KEE Project

Once a KEE project has been tested and produces the desired results, the KEE project can be uploaded to a remote Squirro server to be used for enriching all incoming data.

Similar to creating fixtures from remote Squirro items, deploying a KEE project to a Squirro server requires that the squirro section is present within the config.json file. This section includes the information necessary to successfully authenticate with the Squirro server.

To upload a KEE project to a Squirro server, the kee upload command is used.

>> squirro_kee upload

After the KEE project is uploaded to a Squirro server, the KEE will be available as an enrichment under the Enrich tab of the Squirro frontend. Each uploaded KEE project requires a unique name which can be customized within the kee section of the config.json file, as shown below.

config.json

{
    "kee": {
        "pipelet": "Salesperson Extraction"
    },

    "squirro": {
...

In this example, the KEE project will be available as a pipelet with the title "Salesperson Extraction" once it is uploaded.

Use other Data Sources

This tutorial uses a CSV data source. You can use any other data source which the Data Loader can connect to. For example, to connect directly to Salesforce.com, use the /wiki/spaces/DOWN/pages/54624264 (license required) and then change the source section to something like this:

config.json

{
    "sources": {
        "salespeople": {
            "source_script": "salesforce/salesforce.py",
            "salesforce_user": "…",
            "salesforce_password": "…",
            "salesforce_token": "…",
            "salesforce_query": "SELECT Id,Name,Email,ManagerId FROM User WHERE IsActive=true",


            "field_id": "Id",
            "field_matching": ["Name", "Email"],
            "hierarchy": "ManagerId -> Id",
        }
    }
}

Download this Example KEE Project

The full source for this example is available for download: tutorial.zip.

KEE Tutorial

Table of Contents

Webinar

A Simple Example

Setting up A KEE Project

Setting up the KEE Project folder

Setting up the initial Configuration

Creating a strategy

Testing a KEE Project

Compiling the lookup database

Creating Fixtures

Adding Squirro Server Configuration

Testing with Fixtures

Deploying a KEE Project

Use other Data Sources

Download this Example KEE Project