...
Code Block | ||||
---|---|---|---|---|
| ||||
<table>
<tr>
<th>Type</th>
<td>{{ row.InteractionType }}</td>
</tr>
<tr>
<th>Attendees</th>
<td>{{ item.keywords.Attendees | join(', ') | e}}</td>
</tr>
<tr>
<th>Duration</th>
<td>{{ row.DurationMinutes }} minutes</td>
</tr>
</table>
<hr>
<p>{{ row.Notes | e}}</p> |
...
When executing this command, the items are loaded into Squirro. But now the duration has been changed to a float value representing the duration in hours instead of minutes.
Custom Data Source
The data loader can be easily extended to implement Writing a custom data source . In this example a quick loader will be implemented that can handle PubMed data in the Medline format. Pubmed is a database of scientific publications for biomedical literature. The Medline format can be retrieved from the site using a simple export.
For this example you can use a list of 106 articles that have been manually extracted. Download the file pubmed.zip and extract it into the tutorial folder. This should create a folder called "pubmed".
A sample file in this folder looks as follows:
Code Block | ||||
---|---|---|---|---|
| ||||
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<pre>
PMID- 26785463
OWN - NLM
STAT- Publisher
DA - 20160119
LR - 20160119
IS - 1095-9203 (Electronic)
IS - 0036-8075 (Linking)
VI - 350
IP - 6265
DP - 2015 Dec 4
TI - Teamwork: The tumor cell edition.
PG - 1174-1175
FAU - Cleary, Allison S
AU - Cleary AS
AD - Pennsylvania State University College of Medicine, Hershey PA 17078, USA.
acleary@hmc.psu.edu.
LA - ENG
PT - JOURNAL ARTICLE
TA - Science
JT - Science (New York, N.Y.)
JID - 0404511
CRDT- 2016/01/20 06:00
AID - 350/6265/1174 [pii]
AID - 10.1126/science.aad7103 [doi]
PST - ppublish
SO - Science. 2015 Dec 4;350(6265):1174-1175.
</pre> |
It becomes quickly obvious that this is mostly a textual format consisting of key/value pairs.
To import this format, start by specifying the data load command.
Code Block | ||||
---|---|---|---|---|
| ||||
squirro_data_load ^
-v ^
--cluster %CLUSTER% ^
--project-id %PROJECT_ID% ^
--token %TOKEN% ^
--source-script medline.py ^
--source-path pubmed ^
--map-id PMID ^
--map-title TI ^
--map-created-at DA ^
--map-body AB ^
--source-name "PubMed" ^
--facets-file facets.json
|
There is one key change in this command: instead of using the --source-type
argument, this uses the --source-script
. That script will be defined below and defines how Medline data is processed.
The mapping is done using these keys that were present above in the example.
The facets file is also quite straightforward and makes sure that some of those keywords are indexed as item keywords.
Code Block | ||||
---|---|---|---|---|
| ||||
{
"DA": {
"data_type": "datetime",
"input_format_string": "%Y%m%d",
},
"JT": {
"name": "Journal",
},
"PT": {
"name": "Publication Type",
},
"PST": {
"name": "Publication Status",
},
"OWN": {
"name": "Owner",
},
"STAT": {
"name": "Status",
},
"FAU": {
"name": "Author",
"delimiter": "|",
}
} |
The last step is to create the actual data source. That is a bit more involved. The main blocks are commented below. The goal of this data source is to go through all the medline files on the disk (as specified with the --source-path
argument) and for each of those files return one dictionary. That dictionary is then processed by the data loader through the mappings, facet configurations, templates, etc. in the exact same way as if it had come straight from a CSV file or a SQL database.
...
language | py |
---|---|
title | medline.py |
...
is covered in Example data loader plugin.
Conclusion
This concludes the tutorial - a whirlwind tour through the main features of the Squirro data loader.
...