DataSource Class

This is the parent class for all source classes. All source modules must inherit this class and overwrite all methods.

Mandatory methods:

connect (inc_column=None, max_inc_value=None). This method is used to create the connection to the source and select the desired data. Full load should always be implemented and incremental load only where it can be applied. If inc_column parameter is different than None, an incremental load will be done. For incremental load, use incremental module to store and select the control metadata. Use the metadata to filter the source data.
disconnect(). Disconnect from the source if needed: for database - close connection, for file – close file etc.
getDataBatch(batch_size). This method returns data in batches from the source. When the source has no more data, will return None. It is a generator and uses yield to return data.
The batch size value is set using the command line argument --source-batch-size.
The returned data must be of type list of dictionaries. Each dictionary represents one line from the source and each key represents the column name. The Data Loader tool takes this list and transforms it to a list of Squirro items, then uses the ItemUploader to index the data.

Example of required source output:

Input CSV file:

Date,Team1,Team2,FT,HT
2012-08-18,Arsenal,Sunderland,0-0,0-0
2012-08-18,Fulham,,5-0,2-0

Is transformed to:

[{'Date': '2012-08-18',
  'FT': '0-0',
  'HT': '0-0',
  'Team1': 'Arsenal',
  'Team2': 'Sunderland'},
 {'Date': '2012-08-18',
  'FT': '5-0',
  'HT': '2-0',
  'Team1': 'Fulham',
  'Team2': ''}]

getJobId(). Used for job locking. Returns a unique identifier for the load. If the source is database, it returns a hash of the select statement, if the source is CSV, it returns the file name etc. For incremental loads, must be the same for all related loads.
getSchema(). Returns the header of the data source (list containing the name of the source columns). It is used to expand the wildcards inside the facets configuration file and check if the mapped columns exist on the source.
getArguments(). Used to add source related parameters to the Tool argparse parser.