Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Each KEE project is a tool which analyzes all incoming documents to a Squirro project, and identifies each instance of a known entity (such as a company, product, person, etc.) by tagging the document that contains the known entity with a specific metadata tag. Once known entities are identified, documents can easily be filtered and grouped within a Squirro project based on the known entities that they contain.

Table of Contents

Table of Contents
excludeTable of Contents

Webinar

The example from this tutorial was also shown in the technical partner webinar in February 2016. Please see the KEE Webinar page for a recording.

A Simple Example

As an easy example, let's take a CSV file that includes a list of salespeople employed by a company.

For each salesperson, their name, a unique ID, email address, position, and their manager are provided. The basic layout of the CSV file is shown below:

salespeople.csv
Code Block
title
languagetextsalespeople.csv
id,name,email,position,manager
1,John Smith,jsmith@company.com,District Manager,David Cole
2,Jane Doe,jdoe@company.com,District Manager,David Cole
3,Adrian Fox,afox@company.com,District Representative,Jane Doe
4,David Cole, dcole@company.com, CEO

...

For this project, we will do all of our work in a new folder called kee_salespeople. Within this folder we want to create the following content:

  • salespeople.csv file - This is the file that contains the list of all the salespeople we want to identify. 

  • config.json file - This is the configuration file for the KEE project that describes how we want the KEE project to operate. You can customize the rules for each KEE project and make tweaks to how entities are identified by changing this file.

  • fixtures/ folder - This contains the test items that we will use to configure the KEE project

Setting up the initial Configuration

...

To start, we point the KEE configuration to our list of entities by adding a source to the config.json file as shown below:

config.json
Code Block
languagejs
titleconfig.json
{
    "sources": {
        "salespeople": {
            "source_type": "csv",
            "source_file": "salespeople.csv",

            "field_id": "name",
            "field_matching": ["name", "email"],
            "hierarchy": "manager -> name",
        }
    }
}

...

We do this by adding an entry to the "strategies" section of the config.json file, as shown below:

config.json
Code Block
title
languagejsconfig.json
{
    "sources": {
        "salespeople": {
            "source_type": "csv",
            "source_file": "salespeople.csv",

            "field_id": "name",
            "field_matching": ["name", "email"],
			"hierarchy": "manager -> name",
            "strategy": "salesperson_strategy"
        }
    },

    "strategies": {
        "salesperson_strategy": {
            "min_score": 0.9,
            "keywords": [
                "name",
                "position"
            ],
            "parent_keywords": [
                "name -> manager name",
                "position -> manager position"
            ],
        }
    }
}

...

After running this command, a db/ folder will be created within our KEE project, this folder includes the lookup database (lookup.json) used to identify known entities. If we take a look at this file, we will see a lookup entry for each of the entities within the csv file, as well as the details for each entity within the csv file. For example:

Lookup Database
Code Block
title
languagejsLookup Database
{
  "lookup": {
    "[\"default\", [\"default\"], false]": {
      "jdoe": [
        "Jane Doe"
		...

  "entries": {
    "Jane Doe": {
      "parent_id": "David Cole",
      "data": {
        "email": "jdoe@company.com",
        "manager": "David Cole",
        "position": "District Manager",
        "id": "2",
        "name": "Jane Doe"
      },
      "names": [
        "Jane Doe",
        "jdoe@company.com"
      ],
      "strategy": "salesperson_strategy"
    ...

...

An example of a fixture is shown below:

Example Fixture
Code Block
languagejs
titleExample Fixture
{
    "item": {
        "body": "I spoke with Adrian Fox on the phone earlier this morning..."
    },
    "keywords": {
        "name": ["Adrian Fox"],
        "position": ["District Representative"],
        "manager name": ["Jane Doe", "David Cole"],
        "manager position": ["District Manager", "CEO"]
    }
}

...

In order for the kee tool to be able to create fixtures from a Squirro project, the squirro section must be added to the config.json file, as shown below:

config.json
Code Block
title
languagejsconfig.json
{
    "squirro": {
        "cluster": "http://www.example.com/squirro/",
        "token": "abcabcexampletoken123123123",
		"project_id": "example_project"
    },
        
    "sources": {
        "salespeople": {
...

...

The above command creates a fixture for each of the two Squirro items indicated by the unique IDs present in the list. Unique IDs for a Squirro item can be found at the end of the URL which appears in a browser when an item is selected on the search page (see screenshot below).Image Removed

...

Testing with Fixtures

Once we have a fixture created, we can use the kee command line tool to test the KEE extraction on the fixture. To test a KEE project using the set of fixtures within that KEE project folder, we run the command:

...

Similar to creating fixtures from remote Squirro items, deploying a KEE project to a Squirro server requires that the squirro section is present within the config.json file. This section includes the information necessary to successfully authenticate with the Squirro server.

...

After the KEE project is uploaded to a Squirro server, the KEE will be available as an enrichment under the Enrich tab of the Squirro frontend. Each uploaded KEE project requires a unique name which can be customized within the kee section of the config.json file, as shown below.

config.json
Code Block
languagejs
titleconfig.json
{
    "kee": {
        "pipelet": "Salesperson Extraction"
    },

    "squirro": {
...

...

This tutorial uses a CSV data source. You can use any other data source which the Data Loader can connect to. For example, to connect directly to Salesforce.com, use the /wiki/spaces/DOWN/pages/54624264 (license required) and then change the source section to something like this:

config.json
Code Block
languagejstitleconfig.json
{
    "sources": {
        "salespeople": {
            "source_script": "salesforce/salesforce.py",
            "salesforce_user": "…",
            "salesforce_password": "…",
            "salesforce_token": "…",
            "salesforce_query": "SELECT Id,Name,Email,ManagerId FROM User WHERE IsActive=true",


            "field_id": "Id",
            "field_matching": ["Name", "Email"],
            "hierarchy": "ManagerId -> Id",
        }
    }
}

...