Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents

Table of Contents
maxLevel4
outlinetrue
excludeTable of Contents|Introduction


class squirro_client.document_uploader.DocumentUploader

class squirro_client.document_uploader.DocumentUploader(metadata_mapping=None, batch_size=10, batch_size_mb=150, default_mime_type_keyword=True, timeout_secs=300, **kwargs)

Document uploader class which simplifies the indexing of office documents. Default parameters are loaded from your home directories .squirrorc. See the documentation of ItemUploader for a complete list of options regarding project selection, source selection, configuration, etc.

Parameters:
  • batch_size – Number of items to send in one request.
  • batch_size_mb – Size of documents to send in one request. If this file size is reached, the client uploads the existing documents.
  • metadata_mapping – A dictionary which contains the meta-data mapping.
  • default_mime_type_keyword – If set to True a default keyword is added to the document which contains the mime-type.
  • timeout_secs – How many seconds to wait for data before giving up (default 300).
  • kwargs – Any additional keyword arguments are passed on to the ItemUploader. See the documentation of that class for details.

Typical usage:

Code Block
languagepython
>>> from squirro_client import DocumentUploader
>>> import os
>>> uploader = DocumentUploader(project_title='My Project', token='<your token>', cluster='https://next.squirro.net/')
>>> uploader.upload(os.path.expanduser('~/Documents/test.pdf'))
>>> uploader.flush()

Meta-data mapping usage:

  • By default (i.e. for all document mime-types) map the original document size to a keyword field named “Doc Size”:


    Code Block
    languagepython
    >>> mapping = {'default': {'sq:size_orig': 'Doc Size', 'sq:content-mime-type': 'Mime Type'}}
    >>> uploader = DocumentUploader(metadata_mapping=mapping)


  • For a specific mime-type (i.e. ‘application/vnd.oasis.opendocument.text’) map the “meta:word-count” meta-data filed value to a keyword field named “Word Count”:


    Code Block
    languagepython
    >>> mapping = {'application/vnd.oasis.opendocument.text': {'meta:word-count': 'Word Count'}}
    >>> uploader = DocumentUploader(metadata_mapping=mapping)


Default meta-data fields available for mapping usage:

  • sq:doc_size: Converted document file size.
  • sq:doc_size_orig: Original uploaded document file size.
  • sq:content-mime-type: Document mime-type specified during upload operation.

upload

upload(filename, mime_type=None, title=None, doc_id=None, keywords=None, link=None, created_at=None, filename_encoding=None, content_url=None)

Method which will use the provided filename to create a Squirro item for upload. Items are buffered internally and uploaded according to the specified batch size. If mime_type is not provided a simple filename extension based lookup is performed.

Parameters:
  • filename – Read content from the provided filename.
  • mime_type – Optional mime-type for the provided filename.
  • title – Optional title for the uploaded document.
  • doc_id – Optional external document identifier.
  • keywords – Optional dictionary of document meta data keywords. All values must be lists of string.
  • link – Optional URL which points to the origin document.
  • created_at – Optional document creation date and time.
  • filename_encoding – Encoding of the filename.
  • content_url – Storage URL of this file. If this is set, the Squirro cluster will not copy the file.

Example:

Code Block
languagepython
>>> filename = 'test.pdf'
>>> mime_type = 'application/pdf'
>>> title = 'My Test Document'
>>> doc_id = 'doc01'
>>> keywords = {'Author': ['John Smith'], 'Tags': ['sales', 'marketing']}
>>> link = 'http://example.com/test.pdf'
>>> created_at = '2014-07-10T21:26:15'
>>> uploader.upload(filename, mime_type, title, doc_id, keywords, link, created_at)

flush

flush()

Flush the internal buffer by uploading all documents.