View Source

The Ingester Service process forwards items and the configured pipelet-configuration to the Plumber Service and waits for it’s response

Bottleneck Plumber

Priorly, per default, one plumber worker process gets spawned
That may lead to inefficient pipeline processing because of one slow step
- Ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)
- That happens usually for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP-Tagger)

The Ingester service can spawn multiple worker processes to parallelise the processing of batched steps like pipelet, language-detection, ml-workflow, etc.
One Ingester worker process consumes one batch and splits it into √len(batch_items) mini-batches to allow further parallelisation and increase throughput (since Release 3.3.4).
- Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining step_plumber_mini_batch_threads threads.

$ /etc/squirro/ingester.ini
[ingester]
processors = 2

[pipeline] 
step_plumber_mini_batch_threads = 2

Note

ingester.ini has a related configuration processor.workers
- That setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like webshot-step).

With the example configuration above, the Plumber Service should spawn 4 workers to have always enough resources ready to handle incoming mini-batches served by Ingester processes at any time (ingester.processors x ingester.pipeline.step_plumber_mini_batch_threads
= plumber.server.max_spare = 4)

$ /etc/squirro/plumber.ini
[server]
fork = true
max_spare = 4