Introduction
In this brief tutorial, we will go through the process of setting up, configuring, and deploying a simple Known Entity Extraction (KEE) project into a Squirro project.
Each KEE project is a tool which analyzes all incoming documents to a Squirro project, and identifies each instance of a known entity (such as a company, product, person, etc.) by tagging the document that contains the known entity with a specific metadata tag. Once known entities are identified, documents can easily be filtered and grouped within a Squirro project based on the known entities that they contain.
A more detailed overview of KEE as a whole can be found here: LINK
Contents
A simple example
As an easy example, let's take a CSV file that includes a list of salespeople employed by a company.
For each salesperson, their name, a unique ID, email address, position, and their manager are provided. The basic layout of the CSV file is shown below:
id, name, email, position, manager 1, John Smith, jsmith@company.com, District Manager, David Cole 2, Jane Doe, jdoe@company.com, District Manager, David Cole 3, Adrian Fox, afox@company.com, District Representative, Jane Doe ...
Our goal in this KEE project is to identify a salesperson any time that they are referenced in a document, and tag the document with the name of the salesperson, their position, and the name of their manager.
Setting up A KEE project
In order to make organization easy, each KEE project is stored in its own folder.
So for example, if we had a separate KEE project for identifying specific products sold by the same company, we would have a second KEE project folder for the other KEE project.
Setting up the KEE Project folder
For this project, we will do all of our work in a new folder called kee_salespeople.
Within this folder we want to create the following content:
salespeople.csv
file - This is the file that contains the list of all the salespeople we want to identify.config.json
file - This is the configuration file for the KEE project that describes how we want the KEE project to operate. You can customize the rules for each KEE project and make tweaks to how entities are identified by changing this file.fixtures/
folder - This contains the test items that we will use to configure the KEE project
Setting up the initial Configuration
The full reference for the KEE project configuration file can be found here: LINK
To start, we want to make sure to point the KEE configuration to our list of entities by adding a source to the config.json
file as shown below:
{ "sources": { "salespeople": { "dsn": "csv:///salespeople.csv" "field_id": "name", "field_matching": ["name", "email"], "hierarchy": "manager -> name" } } }
What the above code does is create a new source of known entities called "salespeople", and for this source we set the data source name ("dsn") to point to the csv file salespeople.csv
which is located in the same folder as the config.json
file.
The field_id
field identifies the field "name
" as being the unique identifier for each entity in the csv file.
the field_matching
field provides a list of all the fields that we want to look for to identify a known entity within a document. In this case, we want to look for references to either the salesperson's name, or their email address in the documents in our Squirro project, so we include both of those fields in a list.
The heirarchy
field indicates that there is a hierarchy within the entities in the csv file, where the value in the 'manager' field of one entity points to the name of that entity's parent entity (the person's manager).
Creating a strategy
Once we have the KEE project pointed to the list of known entities, we want to create our first strategy for recognizing known entities within each document.
We do this by adding an entry to the "sources" section of the config.json
file, as shown below:
{ "sources": { "salespeople": { "dsn": "csv:///salespeople.csv", "hierarchy": "manager -> name", "strategy": "salesperson_strategy" } }, "strategies": { "salesperson_strategy": { "min_score": "0.8", "keywords": [ "name", "position" ], "parent_keywords": [ "name -> manager name", "position -> manager position" ] } } }
The code added above creates a strategy called 'salesperson_strategy' for identifying entities and applies it to the source 'salespeople'.
We also set a few basic parameters for this new strategy, such as the minimum score required to produce a match, which we set to 0.8.
Finally, we set the keyword items to tag each match with for both the matching entity (the salesperson) and the matching parent entity (the salesperson's manager). In this case, we tag each document with the name and position of the matching entity in facets called 'name' and 'position' respectively, while we tag each document with the name and position of the matching parent entity in facets called 'manager name' and 'manager position' respectively.
Testing a KEE project
Once we have our list of entities to extract and our initial configuration file, we can begin testing the KEE project and adjusting the settings for our specific use case.
Compiling the lookup database
The first step in testing our KEE project is to compile our lookup database for the entities using the KEE command line tool. This is accomplished by running the following command (with the present working directory set to the folder for our KEE project):
kee compile
After running this command, a db/
folder will be created within our KEE project, this folder includes the lookup database (lookup.json
) used to identify known entities. If we take a look at this file, we will see a lookup entry for each of the entities within the csv file, as well as the details for each entity within the csv file. For example:
{ "lookup": { "[\"default\", [\"default\"], false]": { "jdoe": [ "Jane Doe" ... "entries": { "Jane Doe": { "parent_id": "David Cole", "data": { "email": "jdoe@company.com", "manager": "David Cole", "position": "District Manager", "id": "2", "name": "Jane Doe" }, "names": [ "Jane Doe", "jdoe@company.com" ], "strategy": "salesperson_strategy" ...
It is important to remember that any time we make changes to the config.json
file, we have to rerun "kee compile
" for those changes to take affect.
Creating and using fixtures
Once we have our lookup database compiled, we can begin testing the KEE project by running example squirro items, or "fixtures" through it. Generally, we store the fixtures that we use for testing in a folder within the KEE project called fixtures/.
Each file within the fixtures folder is a JSON document which includes an example squirro item that we want to use to test the KEE, and a list of all the tags that we expect to be added to that item as a result of the KEE process.
An example of a fixture is shown below:
{ "item": { "body": "I spoke with Adrian Fox on the phone earlier this morning..." }, "keywords": { "name": ["Adrian Fox"], "position": ["District Representative"], "manager name": ["Jane Doe"], "manager position": ["District Manager"] } }
In addition to creating fixtures manually using a tool such as a text editor, fixtures can be created automatically from existing squirro items by running:
kee get_fixture