Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This is the parent class for all source classes. All source modules must inherit this class and overwrite all methods.

Mandatory methods:

  • connect (inc_column=None, max_inc_value=None). This method is used to create the connection to the source and select the desired data. Full load should always be implemented and incremental load only where it can be applied. If inc_column parameter is different than None, an incremental load will be done. For incremental load, use incremental module to store and select the control metadata. Use the metadata to filter the source data.
  • disconnect(). Disconnect from the source if needed: for database - close connection, for file – close file etc.
  • getDataBatch(batch_size). This method returns data in batches from the source. When the source has no more data, will return None. It is a generator and uses yield to return data.
    The batch size value is set using the command line argument --source-batch-size.
    The returned data must be of type list of dictionaries. Each dictionary represents one line from the source and each key represents the column name. The Data Loader tool takes this list and transforms it to a list of Squirro items, then uses the ItemUploader to index the data.

Example of required source output:

Input CSV file:

Code Block
languagetext
Date,Team1,Team2,FT,HT
2012-08-18,Arsenal,Sunderland,0-0,0-0
2012-08-18,Fulham,,5-0,2-0

Is transformed to:

Code Block
languagetext
[{'Date': '2012-08-18',
  'FT': '0-0',
  'HT': '0-0',
  'Team1': 'Arsenal',
  'Team2': 'Sunderland'},
 {'Date': '2012-08-18',
  'FT': '5-0',
  'HT': '2-0',
  'Team1': 'Fulham',
  'Team2': ''}]

...

page can now be found at DataSource Class on the Squirro Docs site.