The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview

The Using OCRmyPDF, the PDF OCR step belongs to the Enrich steps. Using OCRmyPDF, It extracts text from files of MIME type application/pdf that don’t contain any machine-readable text.

The Content Augmentation step needs to run before , and the Content Extraction (formerly Content Conversion) step after the PDF OCR step.

...

Configuration

Field	Default	Description	UI Setting
`replace_file`	True	Replace original PDF file with a PDF file containing the extracted text overlay.
`ocr_timeout`	60	Maximum time in seconds spent on OCR per document

...

Versions Compared

Old Version 2

New Version 3

Key

Overview

Configuration

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

Overview

Configuration