The Ingester Service process forwards items and the configured
pipelet-configuration to the Plumber Service and waits for it’s response
Priorly, per default, one plumber worker process gets spawned
That may lead to inefficient pipeline processing because of one slow step
Ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)
That happens usually for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP-Tagger)
The Ingester service can spawn multiple worker processes to parallelise the processing of batched steps like
pipelet, language-detection, ml-workflow, etc.
One Ingester worker process consumes one batch and splits it into
√len(batch_items) mini-batches to allow further parallelisation and increase throughput (since Release 3.3.4).
Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining
$ /etc/squirro/ingester.ini [ingester] processors = 2 [pipeline] step_plumber_mini_batch_threads = 2
ingester.ini has a related configuration
That setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like webshot-step).
With the example configuration above, the Plumber Service should spawn 4 workers to have always enough resources ready to handle incoming mini-batches served by Ingester processes at any time (
ingester.processors x ingester.pipeline.step_plumber_mini_batch_threads
= plumber.server.max_spare = 4)
$ /etc/squirro/plumber.ini [server] fork = true max_spare = 4