Introduction

Squirro can be run as a clustered application across multiple hosts, to provide high availability and horizontal scaling.

While it is technically possible to scale a single Squirro cluster across multiple datacenter, it is not recommended. Both the Squirro cluster service as well as the Elasticsearch cluster need stable and low latency network connections. Both can usually not be guaranteed across multiple locations.

The recommended scenario for BCP is to setup two fully independent Squirro clusters can be setup. The two clusters are operated in an Active-Standby setup. All incoming data and query traffic should be directed to the active cluster, all data gets replicated to the standby frequently (e.g. every 5 or 15 or 60 minutes)

The replication between the two cluster is done using a CLI utility provided by Squirro. We plan to fully integrate BCP replication support into the Squirro cluster service and UI in a future release of Squirro.

Overview

 

Core Concepts

Technology used

Requirements

The replication is triggered and run from a single host. This usually the primary app server on the production environment but can also run from a dedicated host, not part of the two clusters.
For added resilience the script and configuration is deployed to all Squirro nodes, but only actively run on the leader node.

The framework Fabric is used to communicate with the various host roles across both datacenter. (Fabric relies on key based SSH connections) 

The host running the replication has the following requirements:

 SSH connection to all production nodes is not mandatory. Alternatively access on TCP Port 443 and TCP Port 3306 (MySQL) is sufficient too.

 If SSH connection from the Production to the BCP Datacenter are not possible, then the solution is to run the replication in three independent stages:

  • Stage 1: Backup Production to NFS
  • Stage 2: Replicate the NFS folder to BCP
  • Stage 3: Restore BCP from NFS

The main disadvantage of this approach is that there is no longer a single script that is aware of success or failure of the replication process. On the BCP side it can also be challenging to identify if the replication Prod -> BCP of the NFS folder has concluded and is stable.

Replication Workflow

These are the stages wich the Replication script uses:

Stage 1: Testing

Before the replication commences fabric is used to connect to the production cluster nodes to validate that the cluster is fully operational and reachable.


If any of these step reveal any issues, the replication job is aborted with verbose debug output.

Stage 2: Elasticsearch Snapshot Creation

Stage 3: MySQL Backup

Stage 4: Config and Assets Backup

Stage 5: NFS Replication

With all data stored on the NFS mount, the contents of the entire mount are replicated to the BCP datacenter.
This can be done using Rsync via SSH or a storage vendor related replication technology (e.g. Netapp SnapMirror)

 While during the initial replication the volume can be big, subsequent replication runs should be small since with the exception of the MySQL export all methods are incremental. The higher the replication frequency, the lower the replicated data volume should be.

Stage 6: Elasticsearch Snapshot Restore

From the BCP NFS mount, the latest Elasticsearch Snapshot is restored into the ES cluster using the official Snapshot module. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html. There will be a service interuption during restore, but since the restore

Stage 7: MySQL Restore

From the BCP NFS mount, the latest MySQL backup will be restored to the Squirro leader. The followers will replicate immediately to the same state.

Stage 8: Config and Assets Restore.

From the BCP NFS mount the contents of the cluster fs is synced with the Squirro cluster master. Optional: If both cluster are setup identical, the config files under /etc/squirro are also synced

Stage 9: Flushing Redis Cache

Stage 10: Restart Squirro

Stage 11: Testing II

Changing the Replication Direction

The same mechanism is used to replicate from BCP to production. The best practice approach is to setup and test this scenario, but to not execute the script using e.g. cron automatically.

Once BCP becomes active, the replication cron job on production is stopped and the script on BCP enabled.

For maximum safety we recommend to separate Production -> BCP and BCP -> Production into dedicated folders in the NFS mount. This way an accidental reversal of the direction cannot lead to unwanted and permanent data loss.

Reduced number of nodes in BCP

The ideal setup is to run production and BCP with the exact same setup. This way the user experience will not degrade when a failover to BCP occurs.
It is however possible to run a reduced setup in BCP. E.g. instead of 3 nodes, only 1 node can be used.

Note that you should never run an even number of Squirro application and elasticsearch nodes since both system benefit from the ability to build quorums to detect and handle network segmentation events.

Backup the NFS mount

It is highly recommended that  the NFS mount is regularly backed up or protected by a vendor specific snapshotting technology.
The NFS mount can be easily used to restore previous cluster states, and is ideal for disaster recovery. 

Known Limitations

Session reset during failover

If a user logs into Prod and then moves (via the LB or GLB) to the replicated BCP installation, the User will get logged out.
This is unavoidable, as the user session stored in the Production cluster MySQL server is most likely not (yet) replicated to BCP.

The issue is minor if the SSO integration is active since the user will be logged in automatically.