Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Note

WORK IN PROGRESS

The PDF OCR (optical character recognition) step converts images of text embedded in PDF files into machine-encoded text. Images of text are typically found in PDFs of scanned documents.

Overview

The PDF OCR step belongs to the Enrich-steps. It extracts text from files of MIME type application/pdf that don’t contain machine-readable text.

The Content Extraction (formerly Content Conversion) step is required to run after the PDF OCR step.

Configuration

...

Field

...

Default

...

Description

...

UI Setting

...

replace_file

...

True

...

Replace original PDF file

...

ocr_timeout

...

60

...

Maximum time in seconds spent on OCR per document

...

This page can now be found at PDF OCR on the Squirro Docs site.