Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


location
description
/etc/squirro/ingester.ini

ingester configuration

Code Block
[ingester]
# number of new pipeline processors
processors = 5

# if there are no items to process, how long to wait until checking again
sleep_in_secs_if_no_batch = 5

# how often the ingester file reaper should should check for orphaned files in content streamer
clean_up_interval_in_hours = 1

[processor]
# number of workers to use on non-batched steps
workers = 10

# maximum number of retries to retry an item that failed processing a step
max_retries = 10


/etc/squirro/common.ini

data retention configuration

Code Block
[content_filesystem_stream]

# root directories for ingester content stream (space separated if more than one)
data_directories = /var/lib/squirro/inputstream

# number of days and hours we keep around item batches that failed to be ingested
# total time is days + hours
days_to_retain_failed_batches = 30
hours_to_retain_failed_batches = 0


/var/log/squirro/ingester/

ingester log folder with the following files:

ingester.log[.123] : Log file of the main ingester process. This mainly contains information about the other subprocesses (or which there are 5 by default) that perform the actual work of procesing data

processor_?/processor.log[.123] : Log file of the stateless processors that perform the actual work. In the default logging setting, these files contain information about each item and batch of items

/var/lib/squirro/inputstream/

ingester data folder where input data is stored temporarily to be processed in the order the data arrived

subdirectories have the form YEAR-MONTH-DAY-HOUR e.g. 2018-04-21-10 for 10am on April 21, 2018 in UTC and hold data files that have arrived in that particular hour. Once such "hourly data buckets" have been drained the directory should be automatically removed unless the directory corresponds to the "current hour". The time is in UTC.

The "failed" subdirectory holds failed items if there are permanently failed data batched (by default data batches are retried 10 times - if a batch fails that many times). By default failed data files are removed after 30 days.

Monitoring the backlog

The Squirro REST API provides the following methods.


https://topic_api_url endpointPurposeParametersValueMeaning
/v0/ingester/status (GET)


Exposes metrics about the internal state of the ingester


sections




backlog_total

Returns the number of data batches and items, e.g.:

Code Block
# curl http://localhost:81/api/topic/v0/ingester/status.json?sections=backlog_total 2>/dev/null | python -m json.tool
{
    "backlog_total": {
        "batches": 20,
        "items": 20000
    }
}


backlog_detailed

Breaks down the backlog by subscription_id (or "source" id in the Squirro GUI)

Code Block
# curl http://localhost:81/api/topic/v0/ingester/status.json?sections=backlog_detailed 2>/dev/null | python -m json.tool
{
    "backlog_detailed": {
        "91gEcdudThKClMtGPIP7Yw": {
            "batches": 1,
            "items": 10
        }
    }
}


errors_total

Shows failed data batches by subscription_id:

Code Block
# curl http://localhost:81/api/topic/v0/ingester/status.json?sections=errors_total 2>/dev/null | python -m json.tool
{
    "errors_total": {
        "91gEcdudThKClMtGPIP7Yw": 4,
        "LXX80w79QU2o0heCsP1QOw": 3,
        "S3o_x5cUSIaC88Y0vaItYw": 1,
        "WsE9c6VvT_OCPuYioVpdIQ": 3,
        "etIQ4onVR-6xScUINxY8tA": 88,
        "fCa9uuLjRpODz3wUCVFYwQ": 4,
        "qVBkjSCrSyqcSwyEoL0VHA": 3
    }
}


errors

Shows data batch and item level error information, e.g. in this example a read timeout to ElasticSearch in the "Deduplication" Step:

Code Block
# curl http://localhost:81/api/topic/v0/ingester/status.json?sections=errors 2>/dev/null | python -m json.tool
{
    "errors": {
        "91gEcdudThKClMtGPIP7Yw": [
            {
                "item_history": {
                    "-UHu4pPOpECc5yIyVERGDA": "--> unshorten-link (403) --> Batch hKgzx6P8RPm4a3w4ps5ytA [ deduplication [('TIMEOUT', ConnectionTimeout('TIMEOUT', \"HTTPConnectionPool(host='127.0.0.1', port=81): Read timed out. (read timeout=10)\", ReadTimeoutError(\"HTTPConnectionPool(host='127.0.0.1', port=81): Read timed out. (read timeout=10)\",)))] (10013) ] --error 1",
                    "7T9LZNsMJElPHcwD8GUEUg": "--> unshorten-link (60) --> Batch hKgzx6P8RPm4a3w4ps5ytA [ deduplication [('TIMEOUT', ConnectionTimeout('TIMEOUT', \"HTTPConnectionPool(host='127.0.0.1', port=81): Read timed out. (read timeout=10)\", ReadTimeoutError(\"HTTPConnectionPool(host='127.0.0.1', port=81): Read timed out. (read timeout=10)\",)))] (10013) ] --error 1"
                },
                "timestamp": 1524132286.6668465
            }
        ]
    }
}


/v0/ingester/action (POST)


Modify the internal state and processing of the ingester


action


reset

DANGEROUS - USE WITH CARE as this short command is irreversible: This deletes all data held in the ingester's input stream.

Code Block
# curl -XPOST http://localhost:81/api/topic/v0/ingester/action.json -d '{"action": "reset"}'
{"status": "OK"}


reset_subscription

Deletes all data held on behalf of the specified subscription_id

Code Block
# curl -XPOST http://localhost:81/api/topic/v0/ingester/action.json -d '{"action": "reset_subscription", "id": "1234567890123456789012"}'


reset_all_failed

Deletes all data held in the inputstream/failed directory

Code Block
# curl -XPOST http://localhost:81/api/topic/v0/ingester/action.json -d '{"action": "reset_all_failed"}'
{"status": "OK"}


reset_failed_subscription

Deletes data held in the inpustream/failed directory belonging to the specified subscription_id

Code Block
# curl -XPOST http://localhost:81/api/topic/v0/ingester/action.json?action=reset_failed_subscription -d '{"action": "reset_failed_subscription", "id": "1234567890123456789012"}'
{"status": "OK"}


Under the Hood

With ssh access to the /var/lib/squirro/inputstream/ file system and the /var/lib/squirro/ingester/processor_*/ directories, you can perform more detailed analysis and actions such as retrying failed items even after their maximum number of retries have been exhausted.

Often you will find it useful to inspect the contents of the data directories (/var/lib/squirro/inputstream/) and details in the log files (/var/log/squirro/ingester/processor_*) side by side, so we recomment opening two command line windows.

Internal formats

1. Input Stream

The input stream is represented by one level of directories that place data files into hourly buckets:

Code Block
2018-04-20-07/
    data_subscription_fYFpE1YIR1Syqpwn-9h7eA_batch_DtQv1x5nRpqH69IMRgTiKg_processor_LYhEYu6HTQi9heSNfLy5ZQ.json
    data_subscription_Q2ZTMosXSw23-3S-rwU45A_batch_foC5xOBBRqqESiHtTLkptA.json
2018-04-20-08/
    data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_batch_B5ANcYm3SWWxzQ76tyxhlg.json
    data_subscription_KnPiaiPiS1OxgkuxQ7NzzQ_batch_UPM-7cT_Q5a2y6X444wROA_processor_0gv18hZ8RRCNNLQCcIkuqQ.json
failed/
    data_subscription_fVTNoLbKTzO6nPRq1EmRwQ_batch_rTPJYGtRTRizp5rLh9BiUg.json
    data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_batch_w5MfDB1TRxSGZ4LmzP6YWg.json

Directories represent the date and time in UTC, the hour, during which the data "arrived" and have the format HOUR-MONTH-DAY-HOUR and there is one optional "failed" directory that holds data that has failed to be enriched and inserted into Squirro. You can move failed items to any of the date folders if you believe that the underlying error has been resolved. More on that to follow.

The data file names contain the subscription id, the batch id - a uniquifier generated and handed out at upload time, and if a particular data file is currently being processed a "processor id" that maps to one of the 1 to n processors (of which there are 5 by default). These identifiers are useful in connecting data batches to the specific Squirro project and sources/subscriptions and to more detailed error information in case something has gone wrong.

For example to find the project, pipeline workflow, and source title, run the following:

Code Block
# SUBSCRIPTION_ID=I_1ZaeAwQdeSylnGwLuA8Q
# mysql topic -e "select concat((select title from projects where id in (select project_id from objects where id = object_id)), ':', (select name from workflows where id = workflow_id), ':', (select title from sources where id = source_id)) as source from subscriptions where id = \"${SUBSCRIPTION_ID}\";"
+-----------------------------------------------------+
| source                                              |
+-----------------------------------------------------+
| Investment Portfolio:Full:Latest Articles in Stocks |
+-----------------------------------------------------+

Each data file is a text file containing lines in the JSON format. The first of each of these files contains one lines with the "batch_history" which is only populated in the case of failures and retries. Each subsequent line contains a dictionary in JSON format representing a future Squirro item:

Code Block
{"batch_history":["G1CyGkMtRsWmmmSO_rzM3w"],"retries":10,"wait_until":15242743
31,"item_history":{"tdyNwCqj67lDox22M2PcgQ":"--> unshorten-link (220) --> Batch G1CyGkMtRsWmmmSO_rzM3w [ deduplication (14) ] --> content-augmentation (18) --> content-conversion (270) --> Batch G1CyGkMtRsWmmmSO_rzM3w [ language-detection (22) ] --> boilerplate-removal (612) --> nearduplicate-detection (893) --> webshot (308) --> cleanup (123) --> Batch G1CyGkMtRsWmmmSO_rzM3w [ index (260) --> filtering [HTTPConnectionPool(host='127.0.0.1', port=81): Read timed out. (read timeout=120)] (120165) ] --error 1"}}
{"body":"<html><body>...<\/body><\/html>","new_assoc_subscriptions":["ZYq7KRLHRW6c839VFWe45Q"],"title":"title 1","created_at":"2018-04-15T05:28:22","providers":["bing"],"item_created_at":"2018-04-15T00:47:00","provider":"bing","source_id":"dlQ1BJowSHyFF6stfRsYoQ","id":"Gff11vmq85KNp9tvkGGchQ"}
{"body":"<html><body>...<\/body><\/html>","new_assoc_subscriptions":["ZYq7KRLHRW6c839VFWe45Q"],"title":"title 2","created_at":"2018-04-15T05:28:22","providers":["bing"],"item_created_at":"2018-04-15T00:47:00","link":"http:\/\/something.com\/","provider":"bing","source_id":"dlQ1BJowSHyFF6stfRsYoQ","id":"Gff2345q85KNp9tvkGGchQ"}

2. Log Files

To locate activity on behalf of subscription "KnPiaiPiS1OxgkuxQ7NzzQ", batch "UPM-7cT_Q5a2y6X444wROA" above, you might run:

Code Block
# grep UPM-7cT_Q5a2y6X444wROA /var/log/squirro/ingester/processor_?/processor.log*
/var/log/squirro/ingester/processor_3/processor.log:PV:- MainThread squirro.service.ingester.processor 2018-04-20 10:58:39,833 INFO     Batch UPM-7cT_Q5a2y6X444wROA [ deduplication (104) ] processed 1000 items
/var/log/squirro/ingester/processor_3/processor.log:PV:- MainThread squirro.service.ingester.processor 2018-04-20 10:59:05,913 INFO     Batch UPM-7cT_Q5a2y6X444wROA [ language-detection (146) ] processed 1000 items
/var/log/squirro/ingester/processor_3/processor.log:PV:- MainThread squirro.service.ingester.processor 2018-04-20 10:59:42,932 INFO     Batch UPM-7cT_Q5a2y6X444wROA [ index (855) --> filtering (14695) --> cache (33) ] processed 1000 items
/var/log/squirro/ingester/processor_3/processor.log:PV:- MainThread squirro.service.ingester.processor 2018-04-20 10:59:42,934 INFO     Processed u'EJNAMLR_-_7vE72Tsv2D2w' ('KnPiaiPiS1OxgkuxQ7NzzQ') --> unshorten-link (1) --> Batch UPM-7cT_Q5a2y6X444wROA [ deduplication (104) ] --> content-augmentation (1) --> content-conversion (275) --> Batch UPM-7cT_Q5a2y6X444wROA [ language-detection (146) ] --> nearduplicate-detection (285) --> webshot (0) --> cleanup (0) --> Batch UPM-7cT_Q5a2y6X444wROA [ index (855) --> filtering (14695) --> cache (33) ]
/var/log/squirro/ingester/processor_3/processor.log:PV:- MainThread squirro.service.ingester.processor 2018-04-20 10:59:42,934 INFO     Processed u'ymBL3UGBP87Kyv9OJKSMsQ' ('KnPiaiPiS1OxgkuxQ7NzzQ') --> unshorten-link (1) --> Batch UPM-7cT_Q5a2y6X444wROA [ deduplication (104) ] --> content-augmentation
 (2) --> content-conversion (225) --> Batch UPM-7cT_Q5a2y6X444wROA [ language-detection (146) ] --> nearduplicate-detection (181) --> webshot (4) --> cleanup (1) --> Batch UPM-7cT_Q5a2y6X444wROA [ index (855) --> filtering (14695) --> cache (33) ]

The above log lines show how a data batch has been execution some steps in batches: deduplication, language-detection, index, filtering, and cache and others in parallel (last two lines above): unshorten-link, content-augmentation, content-conversion, near-duplicate-detection, webshot and cleanup. The numbers in parentheses following each of the steps indicate the number of milliseconds that step took.

Note that large durations in non-batched steps (in INFO log lines containing the work "Processed") have a larger impact particularly if the batch size is large (as is the case in the example above with a batch size of 1000). While batched steps and their durations are logging inside square brackets [ ], e.g. Batch UPM-7cT_Q5a2y6X444wROA [ language-detection (146) ] while non-batched parallel steps are not, e.g. nearduplicate-detection (181). That is because parallel steps share a pool of 10 threads (by default) such that the a duration of 181 per item for near-duplicate-detection in reality would add out to 181 milliseconds times 100 (181 ms/threads * 1000 items / 10 threads), i.e. over 18 seconds while language-detection takes 146 milliseconds for all 1000 items in the batch together.

Also note that the logged steps are the internal names which as of Squirro release 2.6.0 tend to differ from the external names in the User interface:


Internal Name
Enrichment Type
Internal NameEnrichment TypeInternal NameEnrichment Type
boilerplate-removal

Noise Removal

content-conversion

Content Extraction

nearduplicate-detection

Near-Duplicate Detection

cache

Cache Cleaning

deduplication

Deduplication

pipelet

Pipelet (with custom name)

cleanup

Content Standardization

filtering

Search Tagging and Alerting

unshorten-link

Unshorten Link

content-augmentation

Content Augmentation

language-detection

Language Detection

webshot

Thumbnail Extraction

Debugging Scenarios

Some These are some of the scenarios that we have been using repeatedly and might be useful to you.

1. Why is data failing to arrive?

Suppose the "api/topic/v0/ingester/status.json?sections=errors_total" endpoint returns that there are errors. One way to proceed is to look for the batch_history entry in /var/lib/squirro/inputstream/failed/ files or in files that are still active in date-time folders /var/lib/squirro/inputstream/????-??-??-??/ perhaps filtered by a subscription_id, e.g:

Code Block
# ll -tr /var/lib/squirro/inputstream/*/data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_*
-rw-r--r--. 1 sqingest squirro 508092 Apr 16 11:22 /var/lib/squirro/inputstream/failed/data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_batch_BZPZXCWBQYaZcBMtwLo7gA.json
-rw-r--r--. 1 sqingest squirro 222809 Apr 17 16:12 /var/lib/squirro/inputstream/failed/data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_batch_w5MfDB1TRxSGZ4LmzP6YWg.json
-rw-r--r--. 1 sqingest squirro 555573 Apr 20 14:14 /var/lib/squirro/inputstream/2018-04-20-12/data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_batch_6urzTZ3DTOS-v_l9Uv8njg.json

The list above provides you with batch ids to further look into in logs, but first it usually makes sense to look at the tail end of the batch history. As the batch history tends to be long, we suggest only looking at the last say 160 characters:

Code Block
# grep "batch_history" /var/lib/squirro/inputstream/*/data_subscription_I_1ZaeAwQdeSylnGwLuA8Q_* | sed -e "s/^.*\(.\{160\}\)/\1/"
 B5ANcYm3SWWxzQ76tyxhlg [ index (275) --> filtering [HTTPConnectionPool(host='127.0.0.1', port=81): Read timed out. (read timeout=120)] (120205) ] --error 10"}}
duplicate-detection (96) --> webshot (135) --> cleanup (7) --> Batch BZPZXCWBQYaZcBMtwLo7gA [ index (158) --> filtering (30382) ] --error NO MORE RETRIES: 10"}}
al (268) --> nearduplicate-detection (526) --> webshot (182) --> cleanup (13) --> Batch qZSekOqcRT-2CaETDvo5dQ [ index (93) --> filtering (82358) ] --error 1"}}

In the example scenario above it appears that all three occurrences were due to problems with the "filtering" step, or externally "Search Tagging and Alerting". The first instance points to a timeout of 2 minutes (120 thousand milliseconds) while the second and third error also failed with "filtering", but more detail is needed. The first case suggests increasing the filtering service timeout or perhaps submitting smaller batches, but how do we find more details on the latter two occurrences?

Here the two batch ids "BZPZXCWBQYaZcBMtwLo7gA" and "qZSekOqcRT-2CaETDvo5dQ" help to drill down further using the logs under /var/log/squirro/ingester/processor_?/processor.log*. Pulled the date time in front of each log lines, lets you look at all activity on that particular batch in order which is useful as more than one processors likely has worked on that batch for different retries.

Code Block
# grep "qZSekOqcRT-2CaETDvo5dQ" /var/log/squirro/ingester/processor_?/processor.log* | sed -e "s/^\(.*\)\([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] \)/\2\1/" | sort
2018-04-17 16:12:55,225 /var/log/squirro/ingester/processor_3/processor.log.2:PV:- MainThread squirro.service.ingester.processor INFO     Processed u'mNkU6HiN12HQLv7L0Dhs3Q' ('I_1ZaeAwQdeSylnGwLuA8Q') --> unshorten-link (216) --> Batch qZSekOqcRT-2CaETDvo5dQ [ deduplication (21) ] --> content-augmentation (16) --> content-conversion (7) --> Batch qZSekOqcRT-2CaETDvo5dQ [ language-detection (258) ] --> boilerplate-removal (704) --> nearduplicate-detection (273) --> webshot (154) --> cleanup (156) --> Batch qZSekOqcRT-2CaETDvo5dQ [ index (93) --> filtering (82358) ] --error 1

This shows that processor_3 has work on the batch most recently which allows us to inspect the log file more closely to reveal:

Code Block
PV:- MainThread squirro.service.ingester.processor 2018-04-17 16:12:55,205 INFO     Batch qZSekOqcRT-2CaETDvo5dQ [ index (93) --> filtering (82358) no more items ] processed 0 items
PV:- MainThread squirro.service.ingester.processor 2018-04-17 16:12:55,205 INFO     Item_ids to retry are [u'3UQ7q17Q0pq3lohXhrAX1A', u'CWJ3IOyiH5cAybyqYUWG1A', u'mNkU6HiN12HQLv7L0Dhs3Q', u'6hE9yD1vWriHvXiaNeq0Gw', u'ZfFnPSCv45Xm2qWcVSvVpA', u'raHy2ljSNrVvkf_7uE4_ZQ', u'Yu0jsEGvbDjoK0-Xu2BOfw', u'-jsVk2IE1bmGN1EqvSww4A', u'o8ibQNpaaJRXcvkpGvLOAA', u'Ri1niXuX-H6Z-zOucGXzEQ', u'LFi3FzO2IuI-3J0eMTyI8Q', u'OyF7KiMcQ4RH5-P2ilQJaQ', u'4m1XGg1JNpSiJ9UAAbhCsg', u'205C9xY0PovrRkuktqDqjA', u'ucGylIz8ovtqPEnZKhHBaw']

The filtering steps calls to the filtering service. Using the rough time of the call "2018-04-17 16:12" reveals the following:

Code Block
PV:- Thread-749 squirro.lib.topicproxy 2018-04-17 16:12:55,200 WARNING  unable to get locator for project u'NM_Ee9nRQqyodzC9MIXssw'
PV:- Thread-749 squirro.service.filtering.main 2018-04-17 16:12:55,200 ERROR    Failed to filter items
Traceback (most recent call last):
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/main.py", line 241, in POST
    self.filter_processor.process_batch(items, global_config_dict)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/processor.py", line 76, in process_batch
    self._perform_actions(batch_context, item_context)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/processor.py", line 138, in _perform_actions
    filter_obj, item_context.item_dict, batch_context)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/processor.py", line 232, in _get_filter_if_matches_item
    return runner.check_match(item_dict)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/query.py", line 129, in check_match
    items = self.retrieve_items(query)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/query.py", line 99, in retrieve_items
    res = self._read_items(query)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/query.py", line 104, in _read_items
    reader = self._get_reader()
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/service/filtering/query.py", line 111, in _get_reader
    locator_dict = self.topicproxy.get_project_locator(project_id)
  File "/opt/squirro/virtualenv/lib/python2.7/site-packages/squirro/lib/topicproxy/__init__.py", line 216, in get_project_locator
    raise Exception()
Exception

The above happens to correspond to a bug that we have since fixed and patched with Squirro release 2.6.0-104 .