Note |
---|
WORK IN PROGRESS |
The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.
Overview
The PDF OCR step belongs to the Enrich-steps. It extracts text from files of MIME type application/pdf
that don’t contain machine-readable text.
The Content Extraction (formerly Content Conversion) step is required to run after the PDF OCR step.
Configuration
...
Field
...
Default
...
Description
...
UI Setting
...
replace_file
...
True
...
Replace original PDF file
...
ocr_timeout
...
60
...
Maximum time in seconds spent on OCR per document
...
This page can now be found at PDF OCR on the Squirro Docs site.