...
The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.
Overview
The PDF OCR step belongs to the Enrich- steps. Using OCRmyPDF, It extracts text from files of MIME type application/pdf
that don’t contain machine-readable text.
The Content Augmentation step needs to run before, the Content Extraction (formerly Content Conversion) step is required to run after the PDF OCR step.
Configuration
Field | Default | Description | UI Setting |
---|---|---|---|
| True | Replace original PDF file with a PDF file containing the extracted text overlay. | |
| 60 | Maximum time in seconds spent on OCR per document |
...