Table of Contents
class squirro_client.document_uploader.DocumentUploader
class squirro_client.document_uploader.DocumentUploader(metadata_mapping=None, batch_size=10, batch_size_mb=150, default_mime_type_keyword=True, timeout_secs=300, **kwargs)
Document uploader class which simplifies the indexing of office documents. Default parameters are loaded from your home directories .squirrorc. See the documentation of ItemUploader for a complete list of options regarding project selection, source selection, configuration, etc.
Parameters: |
|
---|
Typical usage:
>>> from squirro_client import DocumentUploader >>> import os >>> uploader = DocumentUploader( ... project_title='My Project', token='<your token>', ... cluster='https://demo.squirro.net/') >>> uploader.upload(os.path.expanduser('~/Documents/test.pdf')) >>> uploader.flush()
Meta-data mapping usage:
By default (i.e. for all document mime-types) map the original document size to a keyword field named “Doc Size”:
>>> mapping = {'default': {'sq:size_orig': 'Doc Size', ... 'sq:content-mime-type': 'Mime Type'}} >>> uploader = DocumentUploader(metadata_mapping=mapping)
For a specific mime-type (i.e. ‘application/vnd.oasis.opendocument.text’) map the “meta:word-count” meta-data filed value to a keyword field named “Word Count”:
>>> mapping = {'application/vnd.oasis.opendocument.text': { ... 'meta:word-count': 'Word Count'}} >>> uploader = DocumentUploader(metadata_mapping=mapping)
Default meta-data fields available for mapping usage:
- sq:doc_size: Converted document file size.
- sq:doc_size_orig: Original uploaded document file size.
- sq:content-mime-type: Document mime-type specified during upload operation.
upload
upload(filename, mime_type=None, title=None, doc_id=None, keywords=None, link=None, created_at=None, filename_encoding=None, content_url=None)
Method which will use the provided filename to create a Squirro item for upload. Items are buffered internally and uploaded according to the specified batch size. If mime_type is not provided a simple filename extension based lookup is performed.
Parameters: |
|
---|
Example:
>>> filename = 'test.pdf' >>> mime_type = 'application/pdf' >>> title = 'My Test Document' >>> doc_id = 'doc01' >>> keywords = {'Author': ['John Smith'], 'Tags': ['sales', ... 'marketing']} >>> link = 'http://example.com/test.pdf' >>> created_at = '2014-07-10T21:26:15' >>> uploader.upload(filename, mime_type, title, doc_id, keywords, ... link, created_at)
flush
flush()
Flush the internal buffer by uploading all documents.