Managing Elasticsearch

Squirro relies on Elasticsearch for the storage nodes where the indexed data is persisted. This guide outlines Elasticsearch configuration changes that are often used in the context of Squirro.

Please also refer to the official Elasticsearch documentation for help.

Where I can find

For the default setup of Squirro, here are most important Elasticsearch folders and files on the Squirro storage node:

location	description
/etc/sysconfig/elasticsearch	elasticsearch sysconfig file, customization of elasticsearch setting, e.g `ES_HOME, CONF_DIR, DATA_DIR, LOG_DIR, ES_JAVA_OPTS`...
/etc/elasticsearch/elasticsearch.yml	elasticsearch config file, customization of elasticsearch config, e.g. cluster name, http port...
/etc/elasticsearch/	elasticsearch config directory: elasticsearch.yml, jvm.options, templates, scripts, synonyms (symlink not allowed)
/var/log/elasticsearch/	elasticsearch log folder (symlink is allowed)
/var/lib/elasticsearch/	elasticsearch data folder (where we store indices) (symlink is allowed)
/usr/share/elasticsearch/	where elasticsearch is installed, contains /bin, /lib, /plugins

Tool to support ES operation

Since version 5.x Elasticsearch does not allow to run some plugins likes head or kopf anymore. You can use an addon to see status of Elasticsearch cluster, index and shards during upgrading or any operation. For example cerebro:

Installation:

Download from https://github.com/lmenezes/cerebro/releases
Extract files
Run bin/cerebro
Access on http://localhost:9000

If you want to use port 9500 as port forward to Elasticsearch node, then use: ssh -L 9500:storagehost:9200. After that you can connect cerebro to http://localhost:9500 to see status of Elasticsearch cluster.

Monitoring

Check Size of indices in ES

curl -XGET localhost:9200/_cat/indices?v

You should see a list of the Squirro indices in Squirro with their sizes and the Project ID at the end of the Index name.

Check cluster/nodes is healthy

curl http://localhost:9200/_cluster/health?pretty=true

you should see a json containing "status" : "green"

Check elasticsearch access from cluster node

Cluster node needs to access storage node through nginx service. To check that access:

curl -L http://127.0.0.1:81/ext/elastic/_cluster/health?pretty=true

you should see json containing "status" : "green"

Check Elasticsearch service is started

ps aux | grep [e]lasticsearch

This command also allows you to view plenty settings of elasticsearch service if it's started, e.g. memory, default.path, process id:

496 122224 1 11 10:20 ? 00:02:14 /usr/bin/java -Xms6g -Xmx6g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Djava.security.policy=file:///etc/elasticsearch/squirro.java.policy -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid -d -Edefault.path.logs=/var/log/elasticsearch -Edefault.path.data=/var/lib/elasticsearch -Edefault.path.conf=/etc/elasticsearch

To start, stop, restart elasticsearch service:

systemctl {start|stop|restart} elasticsearch

View indices, status and size

Each Squirro project by default corresponding to an index and have the format: squirro_{template-version}_{lowercase-project-id}, e.g: squirro_v9_36syha3_ss-zwvn9gyk1ww

curl http://localhost:9200/_cat/indices?v

View shards, status and size

One Elasticsearch index usually contains several shards. number_of_shards is defined in index setting.

curl http://localhost:9200/_cat/shards?v

Tip: There are plenty of useful _cat commands to investigate elasticsearch, just type curl 'http://localhost:9200/_cat' to see them

View Squirro templates

This to make sure the Squirro templates are used in Squirro storage node

# see detail of all templates 
curl -s http://localhost:9200/_template?pretty
# see only squirro template names
curl -s http://localhost:9200/_template?pretty | grep squirro_v

View number of shards and replicas

This to make sure number of shards and replicas are set correctly in Squirro storage node

# in template
curl -s http://localhost:9200/_template?pretty | grep -e number_of -e squirro_v
# in setting of given index
curl -s http://localhost:9200/{index_name}/_settings?pretty | grep -e number_of -e squirro_v

Tip: You can use wildcard syntax * in index_name, e.g /squirro_v9_*/

View mapping, setting of given index

This to make sure new created index used correct Squirro template

curl http://localhost:9200/{index_name}/_mappings?pretty
curl http://localhost:9200/{index_name}/_settings?pretty

View elasticsearch stats

curl http://localhost:9200/_stats?pretty
# stats of given index
curl http://localhost:9200/{index_name}/_stats?pretty
# stats of nodes
curl http://localhost:9200/_nodes/stats?pretty
# stats of cpu and memory
curl http://localhost:9200/_nodes/stats/os?pretty
# stats of file system
curl http://localhost:9200/_nodes/stats/fs?pretty
# stats of jvm
curl http://localhost:9200/_nodes/stats/jvm?pretty
# stats of fielddata (data of text field is loaded to memory for aggregation, sort or in a script)
curl http://localhost:9200/_nodes/stats/indices/fielddata?pretty

Set replicas in multiple nodes cluster

When you have multiple storage nodes, we suggest you to create at least 1 replica for the indices so in case 1 node is shutdown, the storage cluster still works.

curl -XPUT http://localhost:9200/squirro_v9/_settings -H "Content-Type: application/json" -d '{"index": {"number_of_replicas": 1}}'
curl -XPUT http://localhost:9200/squirro_v9_*/_settings -H "Content-Type: application/json" -d '{"index": {"number_of_replicas": 1}}'
curl -XPUT http://localhost:9200/.configsync/_settings -H "Content-Type: application/json" -d '{"index": {"number_of_replicas": 1}}'

Investigate index content

View number of documents and randomly 1 document in the index

# query whole index
curl http://localhost:9200/{index_name}/_search?pretty&size=1

You should see in the response number of documents, for example

{
  "hits" : {
     "total": 12345,
     ...
     "hits": [...list of items...]
}

Get an item by id

 curl 'http://localhost:9200/{index_name}/_doc/{item_id}?pretty'

Query number of items modified before a given time

curl http://localhost:9200/{index_name}/_search?pretty -d '
{
    "query": {
        "range" : {
            "modified_at" : {
                "lte" : "2018-02-13T18:01:44"
            }
        }
    },
	"size": 1
}'

Delete documents by query

Before delete by a query you always have to make sure that the query return only documents you want to delete by searching for that query and review some results:

curl http://localhost:9200/{index_name}/_search?pretty -d '
{
    "query": {
        "term" : {
            "assoc:sources" : "123456789abcdef"
        }
    }
}'

In case you see some wrong document in the index and you want to delete them, e.g. document belong to the same source with id "123456789abcdef", then you can delete them by using query:

curl -XPOST http://localhost:9200/{index_name}/_delete_by_query -d '
{
    "query": {
        "term" : {
            "assoc:sources" : "123456789abcdef"
        }
    }
}'

After deleting by query, the index size on disk is not reduced because a document is not deleted from a segment, just marked as deleted. So if you want to free disk space after deleting, then execute this command:

curl -XPOST http://localhost:9200/{index_name}/_forcemerge?only_expunge_deletes=true

Troubleshooting

Shards are UNASSIGNED

If elasticsearch status is not green (yellow or red), because of UNASSIGNED shard issue, have a look at this document https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/ for solution. However we suggest you to contact our engineers for advance operations.

There are some simple operations you can do by yourself:

Disk free space

Check disk space and free it up: by default Elasticsearch needs about 20% disk free space to re assign index shards to nodes. Investigate this issue by using command "df -h ..." and "du -h ..." to find out free disk space and unimportant files to delete them (e.g. files in /var/log/squirro or /var/log/elasticsearch)

Number of replicas

If you have only one instance of elasticsearch running but number_of_replicas in index settings is bigger than 0 then you also have "yellow" status for that index with some unsigned shards. View them by:

curl -s localhost:9200/_cat/shards | grep UNASSIGNED

Check number of replicas of an index:

curl -XGET http://localhost:9200/{index_name}/_settings?pretty | grep number_of_replicas

If number_of_replicas > 0 then set value to 0:

curl -XPUT http://localhost:9200/{index_name}/_settings -H "Content-Type: application/json" -d '{"number_of_replicas":0}'

Check as well number of replicas in template to make sure future indices do not have wrong number_of_replicas setting:

curl -XGET http://localhost:9200/_template/squirro_v9?pretty | grep number_of_replicas

If number_of_replicas > 0 then modify squirro_v9.json and put this template again

vi /etc/elasticsearch/templates/squirro_v9.json 
# edit the line "number_of_replicas": ...,
bash /etc/elasticsearch/templates/ensure_templates.sh

Retry failed allocation

Check explanation for the unassigned shard:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

Retry failed allocations:

curl -XPOST localhost:9200/_cluster/reroute?retry_failed

Indices are moved

Usually Elasticsearch indices are stored in /var/lib/elasticsearch/, however because of some reasons (server is down and cannot be recovered to old status, symlink is lost, old elasticsearch version puts indices under cluster name...), you cannot find /var/lib/elasticsearch/, index is stayed in another mounted point. To fix this issue:

1. Check content, permission and owner of /var/lib/elasticsearch folder

sudo ls -l /var/lib/elasticsearch/

You should see a folder nodes owned by user elasticsearch and group elasticsearch

2. If /var/lib/elasticsearch is missing or it's content is not as your expected index then

# stop elasticsearch
systemctl stop elasticsearch

choose one of 2 solutions:

a. create symlink to new index location

ln -s  {your_new_index_location} /var/lib/elasticsearch

b. or point elasticsearch data dir to the new location in the config file:

vi /etc/elasticsearch/elasticsearch.yml
# edit the line path.data: {your_new_index_location}

3. Set owner, restart service

# set owner and group for data folder
chown -R elasticsearch:elasticsearch {your_index_location}
# start elasticsearch again
systemctl start elasticsearch
# check indices status
curl http://localhost:9200/_cat/indices?v

Too many scroll contexts

If you encounter this type of exception:

Trying to create too many scroll contexts. Must be less than or equal to: [500]

You can increase the limit of Elasticsearch by running this command on one of the nodes:

curl -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d'{
    "persistent" : {
        "search.max_open_scroll_context": 1000
    },
    "transient": {
        "search.max_open_scroll_context": 1000
    }
}'

Alternatively and better long-term is to reduce the scroll argument of any squirro_client.scan() usage from the default 5m to something like 1m. We also filed a official improvement request to auto close contexts in a future release. Reference: SQ-13364

Elasticsearch fails to start with “Unable to load JNA native support library, native methods will be disabled” error message in the log

This happens when Elasticsearch tries to use the /tmp/ folder, but that folder is mounted with the noexec flag. Or alternatively if another temporary folder is used, the Elasticsearch service user has no execution rights in that folder.
The main reason why the noexec flag would be set on tmp is OS hardening. The tmp folder can be leveraged by bad actors to store and execute things. In a highly hardened system this is not desirable and hence the noexec flag is often set.

The workaround for this is to edit /etc/sysconfig/elasticsearch and to add this line:

ES_TMPDIR=/usr/share/elasticsearch/tmp

This would be a sensible default. But any location will do, as long as the folder is owned by the elasticsearch uid and gid.

Memory setting

You can set Elasticsearch memory in the file /etc/elasticsearch/jvm.options.d/squirro.options. Minimum heap size (Xms) and maximum heap size (Xmx) must be equal to each other, for example:

-Xms8g
-Xmx8g

Elasticsearch may crash because not enough memory on your server. Make sure you set Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches (https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html)

If you run cluster node and storage node on the same machine then set memory for Elasticsearch no more than 30% of your physical RAM.

Elasticsearch also does not allow to set more than 32GB of memory.

Warning

In older installations of Squirro, the amount of memory available to Elasticsearch was set in the file /etc/elasticsearch/jvm.options.

When both files are present in the system, /etc/elasticsearch/jvm.options.d/squirro.options overrides the options in /etc/elasticsearch/jvm.options.

Test an elasticsearch query

Sometimes you see there is exception in the elasticsearch log file where the request body is stored in the log file which is quite long. To reproduce this request for investigation, you can do in following way:

# save the query in json format, e.g to /tmp/es_request.json
# make request to ES using the as input file:
curl http://localhost:9200/{index_name}/_search?pretty -d @/tmp/es_request.json

Cluster Block Exception in ES logs

If the disk usage on the Elasticsearch cluster goes beyond a certain limit, Elasticsearch marks all the indices in the read only mode, and only allowing the deletion of indices/documents to facilitate space recovery. In order to make the indices writable again, first make sure that you have more than 80% disk space available, either by removing old log/unncessary files, adding more disk space or by any other means and then execute the following on each index which is marked as read only.

curl -XPUT http://localhost:9200/*/_settings -H "Content-Type: application/json" -d '{"index.blocks.read_only_allow_delete": null}'

Recover from a corrupted index

Symptoms:

Elasticsearch cluster state is red
One or multiple shards are not allocated

Output of

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

Looks like this:

{
  "index" : "squirro_v9_spzhmtdsrrc78oodbnolza",
  "shard" : 5,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2020-02-14T19:20:02.789Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "7IDt77EJR-uh5PknzF26_Q",
      "node_name" : "squirro-node-2a679356-9373-58ec-bad1-d812fbed0cad",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "xLLfLDCbTgaePDz_oivuQA",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [merge failed])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "codec footer mismatch (file truncated?): actual footer=0 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/usr/share/elasticsearch/data/nodes/0/indices/7Md9uEZOTk-nAlZvRUvNmg/5/index/_oon.cfs\") [slice=_oon_Lucene50_0.tim]))"
            }
          }
        }
      }
    }
  ]
}

At this point if you have replicas or snapshots or backups, then the only right thing to do is to recover those.
The below steps will get your index back up and running in green state, but you will most likely loose some documents.

We are going to use the Lucene CheckIndex utility to validate and fix the corrupted index.

Stop elasticsearch
Note the affected shards in the message above. In this example its shard 5.
Locate the data folder of this shared, in my example this is /usr/share/elasticsearch/data/nodes/0/indices/7Md9uEZOTk-nAlZvRUvNmg/5/index (also printed in the above message)

Backup the affected shard folder, e.g.

tar cvf /tmp/shard5.tar.gz /usr/share/elasticsearch/data/nodes/0/indices/7Md9uEZOTk-nAlZvRUvNmg/5

Enter the lib folder of elasticsearch: cd /usr/share/elasticsearch/lib

Run the following command (adjust the folder to your situation):

java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/7Md9uEZOTk-nAlZvRUvNmg/5/index/ -verbose -exorcise

Check the summary output of the tool, in my case it was:

WARNING: 1 broken segments (containing 179 documents) detected
Took 545.101 sec total.
WARNING: 179 documents will be lost

NOTE: will write new segments file in 5 seconds; this will remove 179 docs from the index. YOU WILL LOSE DATA. THIS IS YOUR LAST CHANCE TO CTRL+C!
  5...
  4...
  3...
  2...
  1...
Writing...
OK
Wrote new segments file "segments_er0"

This is good news, the tool was able to fix the corrupted segement. But, we lost 179 documents, and we don't know which ones!

Enter the index folder, in my case cd /usr/share/elasticsearch/data/nodes/0/indices/7Md9uEZOTk-nAlZvRUvNmg/5/index/
Remove any files that start with 'corrupted', in my case: rm corrupted_Qyj-NdANTo2vr-aUDR6l_g
Start elasticsearch
Check elasticsearch status

Synonyms File Missing

Should text search stop working in your Squirro project, this may be due to a missing synonym analyzer and filters in the index configuration. This would be evident in the topic.log and would also be missing from the ES Index configuration. ES Index configuration can be retrieved by running the following command.

curl -XGET "localhost:9200/$INDEXID" | python -m json.tool

In order to restore the search functionality, the following steps should be taken:

1. Stop any dataloading jobs
2. Stop the ingester service
3. Close the particular index of the project. This can be achieved by the following command. It is important to replace the $INDEXID with the ES index currently experiencing such issues. For more info see https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-close.html

curl -X POST "localhost:9200/$INDEXID/_close?pretty"

4. Once the index has been closed, the Index setting can be updated via the below curl request. For more information please see, https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html.

It is important the following values are changed:

$INDEXID - ES Index
$PROJECTID - Squirro project id of affected index
$SYNONYMNAME - Name of the synonym file that cannot be found (Can be found in topic.log Ex 'title_body_summary')
$SYNOYNMID - ID of the synonym file that cannot be found (Id can easily be found from the Squirro URL (Explore Dashboard → Load → Synoynms → EDIT $SYNONYMMNAME))

curl -XPUT localhost:9200/$INDEXID/_settings -H "Content-Type: application/json" -d'
{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "synonyms_$PROJECTID_$SYNONYMNAME_$SYNONYMIND": {
                        "type": "custom",
                        "tokenizer": "icu_tokenizer",
                        "filter": ["icu_folding", "icu_normalizer", "synonyms_$PROJECTID_$SYNONYMNAME_$SYNOYNMID"],
                        "char_filter": ["html_strip", "quotation_char_filter"]
                    }
                },
                "filter": {
                    "synonyms_$PROJECTID_$SYNONYMNAME_$SYNOYNMID": {
                        "type": "synonym_graph",
                        "synonyms_path": "/etc/elasticsearch/synonyms/$PROJECTID/$SYNONYMID.txt",
                        "updateable": true
                    }

                }
            }
        }
    }
}
'

5. Now that settings have been updated, it is time to open the index. This can be achieved by the below curl command. For more information please visit https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-open-close.html

curl -X POST "localhost:9200/$INDEXID/_open?pretty"

6. Resume Squirro ingester services and data loading jobs

7. Test full-text search and ensure results are returned as normal

Install/Remove Elasticsearch plugins

Use the following command to install an elasticsearch plugin

elasticsearch-plugin install <plugin name>

Use the following command to remove an elasticsearch plugin

elasticsearch-plugin remove <plugin name>