The PDF Conversion
step converts office suite documents (such as Microsoft Office Word, Excel, PowerPoint documents) to PDF documents.
Table of Contents
Overview
The PDF Conversion
step is able to convert non-plain text file formats to PDF format.
Documents from popular office suites, such as Microsoft Office, LibreOffice, or Google Docs, can be used as input and get converted into PDFs.
Specifically, the detection of the file format happens through its MIME type. Documents with the following MIME types can be converted into PDFs by the PDF Conversion
step:
application/msword application/rtf application/vnd.lotus-1-2-3 application/vnd.ms-excel application/vnd.ms-excel.sheet.macroEnabled.12 application/vnd.ms-excel.template.macroEnabled.12 application/vnd.ms-powerpoint application/vnd.ms-powerpoint.presentation.macroEnabled.12 application/vnd.ms-powerpoint.slideshow.macroEnabled.12 application/vnd.ms-powerpoint.template.macroEnabled.12 application/vnd.ms-word.document.macroEnabled.12 application/vnd.ms-word.template.macroEnabled.12 application/vnd.ms-works application/vnd.oasis.opendocument.chart application/vnd.oasis.opendocument.formula application/vnd.oasis.opendocument.graphics application/vnd.oasis.opendocument.graphics-template application/vnd.oasis.opendocument.presentation application/vnd.oasis.opendocument.presentation-template application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.spreadsheet-template application/vnd.oasis.opendocument.text application/vnd.oasis.opendocument.text-master application/vnd.oasis.opendocument.text-template application/vnd.oasis.opendocument.text-web application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.slideshow application/vnd.openxmlformats-officedocument.presentationml.template application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.openxmlformats-officedocument.spreadsheetml.template application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.template application/vnd.sun.xml.calc application/vnd.sun.xml.calc.template application/vnd.sun.xml.draw application/vnd.sun.xml.draw.template application/vnd.sun.xml.impress application/vnd.sun.xml.impress.template application/vnd.sun.xml.math application/vnd.sun.xml.writer application/vnd.sun.xml.writer.global application/vnd.sun.xml.writer.template application/vnd.wordperfect application/x-dbf application/x-extension-txt application/x-quattropro application/x-t602
The PDF Conversion
step belongs to the Enrich
section, and it is part of the Binary Documents
pipeline preset.
When the PDF Conversion
step is used in conjunction with the Content Augmentation
and Content Extraction
steps, the Pipeline Editor automatically sets its position before those two steps. This happens in order to enable further enrichments of the item to act on the obtained PDF representation, which will result in better processing and display of the item on the Squirro UI.
Configuration
The PDF Conversion
pipeline step does not have any configuration options.
However, it is important to note a related configuration option; the pdfconversion.pdf-cache-ttl
. This option is visible and can be configured only by server admins.
This option controls the number of seconds that the generated PDF remains in the pdf_conversion
cache. By default, this is set to 1 day. Every time that an item with a PDF representation is accessed, its TTL is refreshed. If its TTL gets expired without the item being accessed by anyone, it is removed from the cache. The next time that it will be requested, its PDF representation will be generated again and will be displayed as such on the Squirro UI. This is transparent to the end user.