The Ingester Service process forwards items and the configured pipelet-configuration
to the Plumber Service and waits for it’s response
Bottleneck Plumber
Priorly, per default, one plumber worker process gets spawned
That may lead to inefficient pipeline processing because of one slow step
Ingester process may fail with a timeout if the plumber doesn’t manage to respond in time (TimeoutError)
That happens usually for batches (default N=1000) that contain mostly large PDFs combined with a Pipelet-Step that performs computational heavy CPU-bound tasks (like the NLP-Tagger)
Configuration to Increase Throughput
Ingester Service
The Ingester service can spawn multiple worker processes to parallelise the processing of batched steps like
pipelet, language-detection, ml-workflow, etc.
One Ingester worker process consumes one batch and splits it into
√len(batch_items)
mini-batches to allow further parallelisation and increase throughput (since Release 3.3.4).Those mini-batches are handled and sent concurrently to the Plumber Service, using a ThreadPool maintaining
step_plumber_mini_batch_threads
threads.
Code Block |
---|
$ /etc/squirro/ingester.ini
[ingester]
processors = 2
[pipeline]
step_plumber_mini_batch_threads = 2
|
Note
ingester.ini
has a related configurationprocessor.workers
That setting is used only for pipeline-steps that get executed in parallel using a thread-pool (like webshot-step).
Plumber Service
With the example configuration above, the Plumber Service should spawn 4 workers to have always enough resources ready to handle incoming mini-batches served by Ingester processes at any time (
ingester.processors x ingester.pipeline.step_plumber_mini_batch_threads
= plumber.server.max_spare = 4
)
Code Block |
---|
$ /etc/squirro/plumber.ini
[server]
fork = true
max_spare = 4 |
|
...
This page can now be found at Scaling Pipelet Execution on the Squirro Docs site.