Introduction
The data loader can be easily extended to implement a custom data source. For an introduction see Writing a custom data loader and Data Loader Plugins.
In this example a quick loader will be implemented that can handle PubMed data in the Medline format. Pubmed is a database of scientific publications for biomedical literature. The Medline format can be retrieved from the site using a simple export.
Data
For this example you can use a list of 106 articles that have been manually extracted. Download the file pubmed.zip and extract it into the tutorial folder. This should create a folder called "pubmed".
A sample file in this folder looks as follows:
|
It becomes quickly obvious that this is mostly a textual format consisting of key/value pairs.
Data loader command
To import this format, start by specifying the data load command:
Code Block | ||
---|---|---|
| ||
squirro_data_load ^
-v ^
--cluster %CLUSTER% ^
--project-id %PROJECT_ID% ^
--token %TOKEN% ^
--source-script medline.py ^
--source-path pubmed ^
--map-id PMID ^
--map-title TI ^
--map-created-at DA ^
--map-body AB ^
--source-name "PubMed" ^
--facets-file facets.json |
There is one key change in this command: instead of using the --source-type
argument, this uses the --source-script
. That script will be defined below and defines how Medline data is processed.
The mapping is done using these keys that were present above in the example.
The facets file is also quite straightforward and makes sure that some of those keywords are indexed as item keywords. Use this facets.json
file:
Code Block | ||
---|---|---|
| ||
{
"DA": {
"data_type": "datetime",
"input_format_string": "%Y%m%d",
},
"JT": {
"name": "Journal",
},
"PT": {
"name": "Publication Type",
},
"PST": {
"name": "Publication Status",
},
"OWN": {
"name": "Owner",
},
"STAT": {
"name": "Status",
},
"FAU": {
"name": "Author",
"delimiter": "|",
}
} |
Plugin file
The last step is to create the actual data source. That is a bit more involved. The main blocks are commented below. The goal of this data source is to go through all the medline files on the disk (as specified with the --source-path
argument) and for each of those files return one dictionary. That dictionary is then processed by the data loader through the mappings, facet configurations, templates, etc. in the exact same way as if it had come straight from a CSV file or a SQL database.
...
language | py |
---|
...
This page can now be found at Example Data Loader Pluginon the Squirro Docs site.