Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Table of Contents
outlinetrue
excludeTable of Contents

Introduction & Basic Concepts

This document aims to outline how Squirro can scale from a single, small development host to a massive multi-node cluster, ingesting millions of documents daily and serving thousand of users.

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Basic Concepts

Before going further we should get familiar with the Squirro Architecture. This diagram gives an overview of the key components and technologies Squirro is built with.

...

Squirro is built for both the cloud and for on-premise deployments. For this we chose a service oriented approach, where the service is provided not by a few monolithic processes, but a fleet of small, very focused services.

Info

In computingmicroservices is a software architecture style in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task, facilitating a modular approach to system-building. 

https://en.wikipedia.org/wiki/Microservices
 

...

See BCP - Business Continuity Planning for more details on how to run Squirro across multiple data centers.

Squirro Components

Elasticsearch

Info
titleWhat is Elasticsearch?

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search and analytics engine with a HTTP web interface and schema-free JSON documents. It is designed from the ground up to support massively distributed deployments with high-availability on commodity hardware.

https://en.wikipedia.org/wiki/Elasticsearch

Squirro uses Elasticsearch as its primary means to store and query its documents. When you ingest data into Squirro, Elasticsearch is the component that will store the bulk of the data. Unless you index binary documents it is actually the only persistence component that will grow based on the data volume you ingest. 

Thanks to Elasticsearch, Squirro can scale to massive sizes with relative ease, but there are still a few things to consider:

...

While Squirro will take away some of the main concerns, e.g. with its intelligent caching layer, the basic logic still applies with Elasticsearch: Start small within the development environment, load real data, test and measure the performance under as real as possible conditions, and adjust & scale accordingly.

Time Based Indexes

Another important aspect is that in Elasticsearch queries can be run against one or multiple indexes. For the application executing the query this makes no difference. This allows for some really powerful patterns if the data can be expired after a given time or if it will be queried by the end users less frequently.

...

This way the data being queried is lower and the query performance will be far superior and the memory overhead for the concurrent queries far lower.

Further reading

See this excellent document from Elasticsearch about how to design for scaling: Elasticsearch: Designing for Scale 

MySQL

Info
titleWhat is MySQL?

MySQL is an open-source relational database management system (RDBMS); in July 2013, it was the world's second most widely used RDBMS, and the most widely used open-source client–server model RDBMS. ... MySQL is a popular choice of database for use in web applications, and is a central component of the widely used LAMP open-source web application software stack (and other "AMP" stacks). 

https://en.wikipedia.org/wiki/MySQL

The MySQL RDBMS is used by Squirro to store configuration data.

No documents are stored in MySQL, hence it does not have to scale with the data volume ingested into Squirro. We've seen deployments with millions of documents and the MySQL database was less than 10 MB.

The main data stored in MySQL is:

  • Users
  • Groups
  • Roles
  • User Sessions / Auth Tokens
  • Dashboards
  • Smart Filters
  • Saved Searches
  • Scheduled Tasks / Alerts

...

Scaling

To achieve high availability and scalability the Squirro cluster service configures MySQL in a Master → Slave replication setup. Each node in the Squirro cluster can contribute a MySQL service to the cluster, and the cluster services uses Zookeeper to automatically elect a leader. All remaining nodes will become followers and replicate the entire database.

Read operations will be handled by all nodes, writes will go to the leader and replicated out to the followers.

There are two main concerns when scaling:

  • Number of write operations to the Master node
  • Concurrent Open Connections to any of the MySQL nodes. (each connection requires RAM)

In general pressure on MySQL is low and the caching layer and connection pooling improve the scaling capability further.

Redis

Info
titleWhat is Redis?

Redis is a data structure server. It is open-source, networked, in-memory, and stores keys with optional durability.  According to the monthly ranking by DB-Engines.com, Redis is the most popular key-value database.

https://en.wikipedia.org/wiki/Redis
 

Redis is used by Squirro for queuing, temporary data storage and caching.

Temporary Data Storage

Documents are stored in Redis only "in flight". While documents are processed by the pipeline service a copy is kept in Redis for fast access / update. Once documents exit the pipeline and are persisted in Elasticsearch they get removed from Redis.

Queuing

Redis is the primary queueing backend for the pipeline service. Only document references enter the queue, hence the size of that data is minimal. The queing is abstracted using the Kombu library.  Other backends such as in-memory, Amazon SQS, RabbitMQ can be supported too.

Caching

Redis is used by Squirro to provide an intelligent caching layer, both for the queries that are executed against Elasticsearch as well as for all HTTP requests between the various Squirro services. Cache entries can be stored long term, and get evicted intelligently when changes to the underlying data or the configuration occur.

Scaling

To achieve high availability and scalability the Squirro cluster service configures Redis in a Master → Slave replication setup. Each node in the Squirro cluster can contribute a Redis service to the cluster, and the cluster services uses Zookeeper to automatically elect a leader. All remaining nodes will become followers and replicate the entire database.

Read operations will be handled by all nodes, writes will go to the leader and will be replicated out to the followers.

There are two main concerns when scaling:

  • Number of write operations to the Master node (but Redis is extremely fast)
  • The latency between the master node and the slaves.
  • The amount of RAM needed (Redis keeps all data in memory)

GlusterFS

Info
titleWhat is GlusterFS?

The GlusterFS architecture aggregates compute, storage, and I/O resources into a global namespace. ... Capacity is scaled by adding additional nodes or adding additional storage to each node. Performance is increased by deploying storage among more nodes. High availability is achieved by replicating data n-way between nodes.

https://en.wikipedia.org/wiki/Gluster 

Uses

  • Used to store and replicate all filesystem based documents between all nodes. This data only scales with the data ingested into Squirro if the data contains binary documents and that data is also served by Squirro. Otherwise the data requirement are minimal. See next two points.
  • Used to store and replicate all pipeline service plugins (pipelets). Storage need is minimal.
  • Used to store and replicate Excel / CSV files uploaded using the Web UI based data loader. Storage need is minimal. Data gets deleted after ingestion is completed.

Alternatives

GlusterFS is the default and built-in option for providing scalable and clustered file storage. It is however only one of multiple options:

  • Amazon S3 (and compatible storage solution)
  • Network Attached Storage, e.g. NFS
  • None: If you don't ingest binary documents, don't need pipelets and don't upload Excel/CSV files using the WebUI, no shared filesystem is required.

Zookeeper

Info
titleWhat is Zookeeper?

Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub-project of Hadoop but is now a top-level project in its own right. ZooKeeper's architecture supports high availability through redundant services.

https://en.wikipedia.org/wiki/Apache_ZooKeeper

Zookeeper is used by Squirro for leader election. Its design is perfect for this purpose and helps Squirro detect and handle network segmentation / split brain events.

Scaling Considerations

  • Only the cluster state and the data needed for the leader election is stored in Zookeeper.
  • Size is minimal and doesn't increase with the data ingested into Squirro.
  • Run an uneven number of Zookeeper nodes, to ensure robust leader election.