Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

WORK IN PROGRESS

The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview

The PDF OCR step belongs to the Enrich-steps. It extracts text from files of MIME type application/pdf that don’t contain machine-readable text.

The Content Extraction (formerly Content Conversion) step is required to run after the PDF OCR step.

Configuration

Field

Default

Description

UI Setting

replace_file

True

Replace original PDF file

ocr_timeout

60

Maximum time in seconds spent on OCR per document

  • No labels