Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagexml
titletemplate-body.html
<table>
    <tr>
        <th>Type</th>
        <td>{{ row.InteractionType }}</td>
    </tr>
    <tr>
        <th>Attendees</th>
        <td>{{ item.keywords.Attendees | join(', ') | e}}</td>
    </tr>
    <tr>
        <th>Duration</th>
        <td>{{ row.DurationMinutes }} minutes</td>
    </tr>
</table>

<hr>

<p>{{ row.Notes | e}}</p>  

...

When executing this command, the items are loaded into Squirro. But now the duration has been changed to a float value representing the duration in hours instead of minutes.

Custom Data Source

The data loader can be easily extended to implement Writing a custom data source . In this example a quick loader will be implemented that can handle PubMed data in the Medline format. Pubmed is a database of scientific publications for biomedical literature. The Medline format can be retrieved from the site using a simple export.

For this example you can use a list of 106 articles that have been manually extracted. Download the file pubmed.zip and extract it into the tutorial folder. This should create a folder called "pubmed".

A sample file in this folder looks as follows:

Code Block
languagetext
title26785463.txt (PubMed Medline example)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<pre>
PMID- 26785463
OWN - NLM
STAT- Publisher
DA  - 20160119
LR  - 20160119
IS  - 1095-9203 (Electronic)
IS  - 0036-8075 (Linking)
VI  - 350
IP  - 6265
DP  - 2015 Dec 4
TI  - Teamwork: The tumor cell edition.
PG  - 1174-1175
FAU - Cleary, Allison S
AU  - Cleary AS
AD  - Pennsylvania State University College of Medicine, Hershey PA 17078, USA.
      acleary@hmc.psu.edu.
LA  - ENG
PT  - JOURNAL ARTICLE
TA  - Science
JT  - Science (New York, N.Y.)
JID - 0404511
CRDT- 2016/01/20 06:00
AID - 350/6265/1174 [pii]
AID - 10.1126/science.aad7103 [doi]
PST - ppublish
SO  - Science. 2015 Dec 4;350(6265):1174-1175.
</pre>

It becomes quickly obvious that this is mostly a textual format consisting of key/value pairs.

To import this format, start by specifying the data load command.

Code Block
languagepowershell
titleData load command
squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-script medline.py ^
    --source-path pubmed ^
    --map-id PMID ^
    --map-title TI ^
    --map-created-at DA ^
    --map-body AB ^
    --source-name "PubMed" ^
    --facets-file facets.json

There is one key change in this command: instead of using the --source-type argument, this uses the --source-script. That script will be defined below and defines how Medline data is processed.

The mapping is done using these keys that were present above in the example.

The facets file is also quite straightforward and makes sure that some of those keywords are indexed as item keywords.

Code Block
languagejs
titlefacets.json
{
    "DA": {
        "data_type": "datetime",
        "input_format_string": "%Y%m%d",
    },
    "JT": {
        "name": "Journal",
    },
    "PT": {
        "name": "Publication Type",
    },
    "PST": {
        "name": "Publication Status",
    },
    "OWN": {
        "name": "Owner",
    },
    "STAT": {
        "name": "Status",
    },
    "FAU": {
        "name": "Author",
        "delimiter": "|",
    }
} 

The last step is to create the actual data source. That is a bit more involved. The main blocks are commented below. The goal of this data source is to go through all the medline files on the disk (as specified with the --source-path argument) and for each of those files return one dictionary. That dictionary is then processed by the data loader through the mappings, facet configurations, templates, etc. in the exact same way as if it had come straight from a CSV file or a SQL database.

...

languagepy
titlemedline.py

...

is covered in Example data loader plugin.

Conclusion

This concludes the tutorial - a whirlwind tour through the main features of the Squirro data loader.

...