Cloud, Big Data and Mobile: Introduction to Apache SolrCloud on AWS

In this post we are going to explore SolrCloud (version 4.X) on AWS. In my previous articles on Solr(3.x)+AWS we saw a glimpse of how complex it is to partition and scale out Solr nodes. Latest version of SolrCloud was designed from ground up addressing these architectural needs for cloud based implementations. SolrCloud comes with highly available, fault tolerant cluster architecture built in with distributed indexing and search capabilities.

Components of Apache SolrCloud

SolrCloud consists of following architectural components as mentioned in the below diagram:

Zookeeper: Solr uses Zookeeper as the cluster configuration and coordinator. Zookeeper is a distributed file system containing information about all the Solr Nodes. Solrconfig.xml, Schema.xml etc are stored in the repository.

Though Zookeeper comes embedded with Solr in default package, It is recommended to run Zookeepers in separate EC2 instances in production setups.
Since Zookeeper EC2 maintain the cluster information either m1.small/medium is enough for most of the setups. For heavily used setups we can start with m1.large for the Zookeeper.
Single Zookeeper is not ideal for a large Solr clusters because of SPOF, it is recommended to configure Zookeepers in concert as an ensemble comprising of at least 3 EC2 nodes starting with. Every ZooKeeper server needs to know about every other ZooKeeper server in the ensemble, and a majority of EC2’s (called a Quorum) are needed to provide service. For example, in a ZooKeeper ensemble of 3 EC2’s if anyone of them fails, the cluster will be performing with the remaining 2 ZK EC2 nodes (constituting a majority). In case more than 1 ZK nodes can fail, then a 5 EC2 node ZooKeeper setup is needed to allow for the failure of up to 2 EC2 ZK at a time.
In an event the ensemble loses the majority, then reads can be performed from the Solr cluster as usual but the writes will be restricted. We need to get ZK EC2 up and running for the majority number to be satisfied.
Amazon EC2 regions are further sub divided into Availability Zones. In some Amazon EC2 regions there are only 2 availability zones. In such EC2 regions it becomes little complex to design the number of Zookeeper EC2 instances for HA. Imagine if we split 3:2 in two availability zones in an Amazon EC2 region, and all ZK EC2 becomes non-responsive in Zone 1a where we have deployed 3 ZK EC2’s, then the quorum loses the majority and cluster becomes read only.
Since all the ZK EC2 instances communicate with each other and exchange the data constantly, it is recommended to use S3 backed AMI with ephemeral disks itself for the Zookeeper tier.
Modifying the Zookeeper ensemble once started is not easy. So depending upon the HA levels you want i would recommend starting the ZK ensemble with 3/5/7/11 etc nodes. Currently the ZK ensemble modifcation is in the Solr roadmap. Current workaround for this problem is to do a rolling restart of Zookeepers. (Rolling Restart article)
If you want minimize the recovery time objective in event of ZK failures, I would recommend having Amazon Elastic IP assigned for the Zookeeper nodes and enabling ZK Peer discovery + ZK discovery by Solr nodes though this Amazon Elastic IP. In Amazon VPC, if the Solr cluster is going to reside inside the private subnet then VPC IP is enough for the ZK EC2 instances.

Shards & Replica’s: Solr indexes can be partitioned across multiple Solr EC2 nodes called as Shards. Index data present in every shard can be replicated into one or more Replica Node(s). Configuring a Multi shard + Replica architecture in Amazon EC2 is made easy compared to the previous version of Apache Solr. Specify the number of Shards required in the setup with “-DnumShards” parameter, Add the Solr EC2 nodes they will automatically added into the Shard architecture. Example: if you start the cluster with “–DnumShards=2” it means that you are building a 2-shard cluster. The first Solr EC2 instance automatically becomes “shard1” and the second Solr EC2 you launch will automatically attached as “shard2” in the architecture. What happens when you add a third & fourth Solr EC2 instances into the cluster? Since it’s a 2-shard cluster, the third Solr EC2 automatically becomes a replica of shard1 and the fourth Solr EC2 becomes a replica of shard2. You now have 4-node (2-shard 2-replica) Solr cluster running on AWS. The first Solr EC2 node that usually becomes “shard1” is assigned the role of a Leader for that shard group; all subsequent Solr EC2 nodes that join the same “shard1” group will become replica nodes. All writes will happen on the Leader and the same will be pushed to the replicas. If the Leader Node goes down, Solr does automatic fail over and one of the replicas will be promoted as the new leader node.

To know more about Configuring SolrCloud on Amazon VPC, Refer article:

http://harish11g.blogspot.in/2013/03/Configuring-installing-Apache-Solrcloud-solr4.x-on-Amazon-VPC-EC2-AWS.html

How documents are indexed in SolrCloud?

The document indexing process in the SolrCloud works as follows:

Step 1: If an index document request is sent to a Shard leader node, then the leader node first checks if the document belongs to the same shard group. If document belongs to this shard group then it will be written to the transaction log of the shard EC2 node. On the contrary, if the document belongs to another shard group, then it will be automatically forwarded to the leader of the respective shard group, which in turn writes to its transaction logs.

Step 2: Shard Leader EC2 node transfers the delta of the transaction log to all its replica nodes. The replica nodes in turn update their respective transaction logs and send an acknowledgement to the leader node. Once the leader node receives acknowledgements that all the replica node(s) have received the updates, the original index request is responded back.

Step 3: If an index document request is sent to a replica node, then the same is forwarded to the leader node in that shard.

How search is performed in SolrCloud ?

The distributed search is completely easy in SolrCloud, you need not worry about the number of shards/replicas/ZK, which one to query? What happens when one of them is down?

A query request may be sent to one of the Solr EC2 nodes (either a leader or a replica), Solr internally load balances the requests across replicas to process the response. If the “shards.tolerant” parameter is set to “true” then SolrCloud can continue to serve results without any interruption as long as at least one Solr node hosts every shard data. This property returns the search result with the documents that are available and avoids 503 error responses in event of any node is down.

Some interesting features in Apache SolrCloud

Soft Commit: A hard commit calls fsync on the index files to ensure that they have been flushed to stable storage and no data loss will result from a power failure. It is an expensive operation as it involves disk IO. Soft commit feature introduced in Solr 4.X to handle NRT search, is much faster as it only makes the index change visible to the searchers but does not call fsync. This is accomplished by writing the changes to the transaction log. Think of it as adding documents to an in-memory writeable segment. As the size of the transaction log increases over time, it is imperative that these are written to disk using hard commit periodically.

Near Real Time Search (NRT): The /get NRT handler ensures that the latest version of a document is retrieved. It is NOT designed to be used as full-text search. Instead, its responsibility is the guaranteed return of the latest version of a particular document. When a /get request is raised by providing an id or ids (a set of id), the search process is as follows –

It checks if the document is present in the transaction log

If yes, then the document is returned from the transaction log

If no, then the latest opened searcher is used to retrieve the document

Atomic Updates: Atomic Updates is a new feature in Solr 4.X that allows you to update on a field level rather than on a document level (previous versions of Solr). This means that you can update individual fields without having to send the entire document to Solr with the update fields’ values. Internally Solr re-adds the document to the index with the updated fields. Please note that fields in your schema.xml must be stored=true to enable atomic updates.

Related Articles: