Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

WORK IN PROGRESS

The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview

The PDF OCR step belongs to the Enrich steps. Using OCRmyPDF, It extracts text from files of MIME type application/pdf that don’t contain machine-readable text.

The Content Augmentation step needs to run before, the Content Extraction (formerly Content Conversion) step after the PDF OCR step.

Configuration

Field

Default

Description

UI Setting

replace_file

True

Replace original PDF file with a PDF file containing the extracted text overlay.

ocr_timeout

60

Maximum time in seconds spent on OCR per document

  • No labels