...
For each salesperson, their name, a unique ID, email address, position, and their manager are provided. The basic layout of the CSV file is shown below:
Code Block | ||||
---|---|---|---|---|
| ||||
id, name, email, position, manager 1, John Smith, jsmith@company.com, District Manager, David Cole 2, Jane Doe, jdoe@company.com, District Manager, David Cole 3, Adrian Fox, afox@company.com, District Representative, Jane Doe ... |
Our goal in this KEE project is to identify a salesperson any time that they are referenced in a document, and tag the document with the name of the salesperson, their position, and the name and position of their manager.
...
To start, we point the KEE configuration to our list of entities by adding a source to the config.json
file as shown below:
Code Block | ||||
---|---|---|---|---|
| ||||
{
"sources": {
"salespeople": {
"dsn": "csv:///salespeople.csv"
"field_id": "name",
"field_matching": ["name", "email"],
"hierarchy": "manager -> name",
}
}
} |
What the above code does is create a new source of known entities called "salespeople", and for this source we set the data source name ("dsn") to point to the csv CSV file salespeople.csv
which is located in the same folder as the config.json
file.
The field_id
field identifies the field "name
" as being the unique identifier for each entity in the csv CSV file.
the field_matching
field provides a list of all the fields that we want to look for to identify a known entity within a document. In this case, we want to look for references to either the salesperson's name, or their email address in the documents in our Squirro project, so we include both of those fields in a list.
The heirarchy
hierarchy
field indicates that there is a hierarchy within the entities in the csv CSV file, where the value in the 'manager' field of one entity points to the name of that entity's parent entity (the person's manager).
...
We do this by adding an entry to the "strategies" section of the config.json
file, as shown below:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "sources": { "salespeople": { "dsn": "csv:///salespeople.csv", "hierarchy": "manager -> name", "strategy": "salesperson_strategy" } }, "strategies": { "salesperson_strategy": { "min_score": 0.9, "keywords": [ "name", "position" ], "parent_keywords": [ "name -> manager name", "position -> manager position" manager position" ], } } } |
...
The code added above creates a strategy called 'salesperson_strategy' for identifying entities and applies it to the source 'salespeople'.
...
In order for the kee
tool to be able to create fixtures from a Squrro project, the squirro
section must be added to the config.json
file, as shown below:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "squirro": { "cluster": "http://www.example.com/squirro/", "token": "abcabcexampletoken123123123", "project_id": "example_project" }, "sources": { "salespeople": { ... |
This will provide the kee
tool with everything that it needs to authenticate with the squirro Squirro server and download individual items to create fixtures. Please see Connecting to Squirro for more information on how to get this information.
...
The above command creates a fixture for each of the two squirro Squirro items indicated by the unique IDs present in the list. Unique IDs for a Squirro item can be found at the end of the URL which appears in a browser when an item is selected on the search page (see screenshot below).
...
After the KEE project is uploaded to a Squirro server, the KEE will be available as an enrichment under the Enrich tab of the Squirro frontend. Each uploaded KEE project requires a unique name which can be customized within the kee
section of the config.json
file, as shown below.
Code Block | ||||
---|---|---|---|---|
| ||||
{ "kee": { "pipelet": "Salesperson Extraction" }, "squirro": { ... |
...
In this example, the KEE project will be available as a pipelet with the title "Salesperson Extraction" once it is uploaded.
Download this Example KEE Project
...
The full source for this example is available for download: tutorial.zip.