Page Comparison

Introduction

The data loader can be easily extended to implement a custom data source. For an introduction see Writing a custom data loader and Data Loader Plugins.

In this example a quick loader will be implemented that can handle PubMed data in the Medline format. Pubmed is a database of scientific publications for biomedical literature. The Medline format can be retrieved from the site using a simple export.

Data

For this example you can use a list of 106 articles that have been manually extracted. Download the file pubmed.zip and extract it into the tutorial folder. This should create a folder called "pubmed".

A sample file in this folder looks as follows:

Code Block

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<pre>
PMID- 26785463
OWN - NLM
STAT- Publisher
DA  - 20160119
LR  - 20160119
IS  - 1095-9203 (Electronic)
IS  - 0036-8075 (Linking)
VI  - 350
IP  - 6265
DP  - 2015 Dec 4
TI  - Teamwork: The tumor cell edition.
PG  - 1174-1175
FAU - Cleary, Allison S
AU  - Cleary AS
AD  - Pennsylvania State University College of Medicine, Hershey PA 17078, USA.
      acleary@hmc.psu.edu.
LA  - ENG
PT  - JOURNAL ARTICLE
TA  - Science
JT  - Science (New York, N.Y.)
JID - 0404511
CRDT- 2016/01/20 06:00
AID - 350/6265/1174 [pii]
AID - 10.1126/science.aad7103 [doi]
PST - ppublish
SO  - Science. 2015 Dec 4;350(6265):1174-1175.
</pre>

It becomes quickly obvious that this is mostly a textual format consisting of key/value pairs.

Data loader command

To import this format, start by specifying the data load command:

Code Block

language	text

squirro_data_load ^
    -v ^
    --cluster %CLUSTER% ^
    --project-id %PROJECT_ID% ^
    --token %TOKEN% ^
    --source-script medline.py ^
    --source-path pubmed ^
    --map-id PMID ^
    --map-title TI ^
    --map-created-at DA ^
    --map-body AB ^
    --source-name "PubMed" ^
    --facets-file facets.json

There is one key change in this command: instead of using the --source-type argument, this uses the --source-script. That script will be defined below and defines how Medline data is processed.

The mapping is done using these keys that were present above in the example.

The facets file is also quite straightforward and makes sure that some of those keywords are indexed as item keywords. Use this facets.json file:

Code Block

language	json

{
    "DA": {
        "data_type": "datetime",
        "input_format_string": "%Y%m%d",
    },
    "JT": {
        "name": "Journal",
    },
    "PT": {
        "name": "Publication Type",
    },
    "PST": {
        "name": "Publication Status",
    },
    "OWN": {
        "name": "Owner",
    },
    "STAT": {
        "name": "Status",
    },
    "FAU": {
        "name": "Author",
        "delimiter": "|",
    }
}

Plugin file

The last step is to create the actual data source. That is a bit more involved. The main blocks are commented below. The goal of this data source is to go through all the medline files on the disk (as specified with the --source-path argument) and for each of those files return one dictionary. That dictionary is then processed by the data loader through the mappings, facet configurations, templates, etc. in the exact same way as if it had come straight from a CSV file or a SQL database.

...

language	py

...

This page can now be found at Example Data Loader Plugin on the Squirro Docs site.

Versions Compared

Old Version 1

New Version Current

Key

Introduction

Data

Data loader command

Plugin file