Scaling Pipelet Execution

The Ingester Service process forwards items and the configured pipelet-configuration to the Plumber Service and waits for it’s response

Bottleneck Plumber

  • Priorly, per default, one plumber worker process gets spawned

  • That may lead to inefficient pipeline processing because of one slow step

    • Ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)

    • That happens usually for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP-Tagger)

Configuration to Increase Throughput

Ingester Service

  • The Ingester service can spawn multiple worker processes to parallelise the processing of batched steps like pipelet, language-detection, ml-workflow, etc.

  • One Ingester worker process consumes one batch and splits it into √len(batch_items) mini-batches to allow further parallelisation and increase throughput (since Release 3.3.4).

    • Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining step_plumber_mini_batch_threads threads.

1 2 3 4 5 6 $ /etc/squirro/ingester.ini [ingester] processors = 2 [pipeline] step_plumber_mini_batch_threads = 2

Note

  • ingester.ini has a related configuration processor.workers

    • That setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like webshot-step).

Plumber Service

  • With the example configuration above, the Plumber Service should spawn 4 workers to have always enough resources ready to handle incoming mini-batches served by Ingester processes at any time (ingester.processors x ingester.pipeline.step_plumber_mini_batch_threads
    = plumber.server.max_spare = 4)

1 2 3 4 $ /etc/squirro/plumber.ini [server] fork = true max_spare = 4