Tuesday, March 5, 2013

Apache SolrCloud Implementation on Amazon VPC

Recently i architected a 13+ EC2 Node(s) SolrCloud Shard built in with HA and scalability inside Amazon Virtual Private Cloud. Since the base product engineered had both on-premise and SAAS edition, we were unfortunately not able to use Amazon CloudSearch as it may cause AWS dependency in this case in product tech stack.  This Solr architecture was secured using Amazon VPC, Private subnets, Access controls, AWS Security groups and IAM. The SolrCloud cluster was designed to accommodate index data in memory and some search data overflowing to EBS disk. The Cluster was designed to handle ~100 GB of distributed index data and another 200+ GB of Search data on EBS disks.  The Web/App tier, Business and Back ground tier will access the SolrCloud deployed in private subnet.   In this post I will be sharing my experience, learning, best practice and tips on architecting SolrCloud on AWS platform.

In the above architecture SolrCloud is configured on Amazon Virtual Private Cloud (VPC) in Amazon EC2 region: USA-EAST.  Let us explore the diagram in detail.

To know more about components of SolrCloud please refer my earlier post:

Zookeeper (ZK) for SolrCloud Best Practices:
  • Though Zookeeper comes embedded with Solr in default package, i recommend to run Zookeepers in separate EC2 instances in production setups.
  • Since Zookeeper EC2 maintains the cluster information either m1.small/medium is enough for most of the setups for ZK nodes. For heavily used setups we can start with m1.large for the Zookeeper EC2 nodes. We used m1.large for ZK nodes.
  • Single Zookeeper EC2 node is not ideal for a production setups (or) large Solr clusters because of SPOF, it is recommended to configure Zookeepers in concert as an ensemble comprising of at least 3 ZK EC2 nodes starting with.  Every ZooKeeper EC2 node needs to know about every other ZooKeeper EC2 node in the ensemble, and a majority of ZK EC2’s (called a Quorum) are needed to provide search service. For example, in a ZooKeeper ensemble of 3 ZK EC2’s if anyone of them fails, the cluster will be performing with the remaining 2 ZK EC2 nodes (constituting a majority). In case more than 1 ZK nodes can fail, then a 5 EC2 node ZooKeeper setup is  needed to allow for the failure of up to 2 EC2 ZK at a time.
  • In an event the ensemble loses the majority, then reads can be performed from the Solr cluster as usual but the writes will be restricted. We need to get enough number of ZK EC2 up and running for the majority number to be satisfied.
  • Amazon EC2 regions are further sub divided into Availability Zones.  In some Amazon EC2 regions there are only 2 availability zones. In such Amazon EC2 regions it becomes little complex to design the ensemble with optimum number of Zookeeper EC2 instances for HA. Imagine if we split and deploy 3:2 ZK nodes in two availability zones in that Amazon EC2 region, and all ZK EC2 becomes non-responsive in AZ  1a where we have deployed 3 ZK EC2’s, then the quorum loses the majority and cluster becomes read only. There is a chance of intermittent search service failure even though we had designed the cluster with 5 ZK nodes.  This cannot be avoided in such specific Amazon EC2 regions and in event ZK nodes go down we need to get them up in another available AZ.  Luckily for us it was US-East, need to crack a solution for other Amazon EC2 regions. More ZK is always better.
  • In Amazon VPC, I would recommend 3 private subnets in 3 AZ. Deploy the ZK nodes in 3 AZ inside these subnets. Assign VPC internal IP for the same. Enable ZK Peer discovery + ZK discovery by Solr nodes though this VPC internal IP.
  • If you want minimize the recovery time objective in event of ZK failures in AWS NON-VPC cloud, I would recommend having Amazon Elastic IP assigned for the Zookeeper nodes and enabling ZK Peer discovery + ZK discovery by Solr nodes though this Amazon Elastic IP.
  • Since all the ZK EC2 instances communicate with each other and exchange the data constantly, it is recommended to use S3 backed AMI(s) with ephemeral disks for the Zookeeper tier. EBS disks at this tier will add up to unnecessary costs and most cases not needed in this tier.
  • Modifying the running Zookeeper ensemble is not an easy task. So depending upon the HA levels needed i would recommend starting the ZK ensemble with 3/5/7/11 etc nodes. Currently the ZK ensemble number modification is in the Solr roadmap. Current workaround for this problem is to do a rolling restart of Zookeepers. Refer blog Rolling Restart

SolrCloud Shards & Replica’s Implementation practices:
  • Solr comes with Jetty engine; replace this with Stable Tomcat release for production setups.
  • Since the number of Shards cannot be modified on a running SolrCloud cluster it is advised to do proper capacity planning before designing the cluster. Changing the number of shards (ideally increasing) is a pattern that is observed on many growing websites, but sadly, it looks like it is only in the roadmap of Solr.
  • It is recommended to keep the shards and replica nodes in separate Availability zones inside an Amazon EC2 region.  Response latency and Inter region Cost (AZ transfer cost) are the negatives, but this architecture will provide you with better High Availability.
  • No need to Crisscross the Shards and Replicas between AZ, follow the above deployment pattern illustrated in the diagram, it will be easy for maintenance on long term.
  • SolrCloud is configured in the private subnet of the Amazon VPC and access controls are opened for the Web/App residing the public/private subnet to access the SolrCloud search service.
  • Three Availability zones are used US-EAST 1a/1b/1c for this setup and 3 subnets are in respective AZ’s.
  • Access between the subnets residing is different AZ’s are opened.
  • Zookeeper EC2 nodes are installed in all 3 Availability zones inside the subnet ranges (,,
  • Zookeepers EC2 nodes are made aware of the other ZK nodes in the ensemble using the VPC IP’s.
  • 3 Solr Shards are deployed in US-East-1a in subnet range Replica Set 1A/1B/1C are deployed in US-EAST-1B and Replica Set 2A/2B/2C are deployed on US-EAST-1C. 
  • Note: 3rd shard & replica are not illustrated on the diagram for cosmetic purposes

To know more about Configuring SolrCloud on Amazon VPC, Refer article:  

High availability:
  • In event “Shard leader A” EC2 becomes non responsive in US-EAST-1A, the Replica Set 1A or 2A in any of the availability zones will be automatically be elevated to become the new Shard leader
  • In event “Replica node 1A” EC2 becomes non responsive in US-EAST-1B, the Shard leader and other replica node is still active to take the requests
  • In event all the EC2 nodes in US-EAST-1A becomes non responsive (AZ level outage) , the replica Set either in AZ 1B or 1C will be automatically be elevated to become new Leaders and they will take the requests
  • Every Apache Solr Node is made aware of the entire ZK ensemble addresses. All the cluster configuration and coordination will be done by ZK. In event of Apache Solr Node failure and new Solr nodes are brought into the cluster, they will contact the ZK and take their respective roles (shards, leaders, replicas etc).
  • Note: Non responsive EC2 nodes should be monitored actively and new healthy nodes should be replaced in event of failure to maintain the optimum performance levels

  • Search requests will be scaled out and be handled by shard leaders or any of the replica nodes in the cluster
  • For more read Scalability more replicas with better instance type can be added into the architecture
  • Write requests will be handled only by the Shard leaders first, hence for write scaling/load distribution we have employed sharding in place. This way the write requests are distributed between the shards.             
  • Growing Indexes can be scaled Up/Out and distributed using the Sharding architecture of Solr

  • All stats were integrated with Amazon CloudWatch on Custom metrics. Custom scripts were developed.
  • Service level monitoring was done using Nagios.

  • Provisioned IOPS with EBS optimized Instances was used. This was particularly useful and provided consistent performance when disk was accessed for search data
  • RAID 0 EBS Striping for better IO performance. (On the Implementation Road Map)
  • IO testing using Sysbench

Backups & Recovery:
  • Periodic EBS snapshots
  • Periodic Full Solr Dumps

  • Automated deployment using Opscode Chef 10.X for Apache Solr + entire setup

On Road map:
  • RAID 0 EBS Striping for better IO performance
  • TrendMicro Secure Cloud for Disk encryption. 

Though Apache SolrCloud has lots of improvements in the HA, Replication, Distribution and Sharding front, we had some difficulties while planning for future expansion. Some of them are:

  • Number of ZK nodes has to be fixed at cluster creation time, it is not a major problem and there are workarounds.
  • Number of Shards has to be fixed at cluster creation time, This is a major problem for applications whose search data is growing rapidly. At some point they will exceed the provisioned shard capacity will need to add new shard and re-balance the entire cluster. This is currently a complex exercise and i hope SolrCloud team will solve this problem soon. This is one area where Amazon cloudSearch has done excellent automation and eased this for the customers. You can keep expanding the nodes for Scalability and HA seamlessly in Amazon CloudSearch.

Apache SolrCloud on Amazon EC2 FAQ:

  1. How are the Solr replicas assigned to shards?
    1. When a new Solr server is added, it is assigned to the shard with the fewest replicas, tie-breaking on the lowest shard number.
  2. Why should I specify all the ZooKeepers for the parameter –DzkHost?
    1. TCP connections are established between the Solr instance and the ZooKeeper servers. When a ZooKeeper server goes down, the corresponding TCP connection is lost. However, other existing TCP connections are still functional and hence this ensures fault tolerance of the SolrCloud cluster even when one or more ZooKeeper servers are down.
  3. What is a Solr transaction log?
    1. It is an append-only log of write operations maintained by each node. It records all write operations performed on an index between two commits. Anytime the indexing process is interrupted, any uncommitted updates can be replayed from the transaction log.
  4. When does the old-style replication kick in?
    1. When a Solr machine is added to the cluster as a replica, it needs to get itself synchronized with the concerned shard. If more than 100 updates are present, then an old-style master-slave replication kicks off. Otherwise, transaction log is replayed to synchronize the replica.
  5. How is load balancing performed on the Solr client side?
    1. Solr client uses LBHttpSolrServer. It is a simple round-robin implementation. Please note that this should NOT be used for indexing.
  6. What will happen if the entire ZooKeeper ensemble goes down or quorum is not maintained?
    1. ZooKeeper periodically sends the current cluster configuration information to all the SolrCloud instances. When a search request needs to be performed, the Solr instance reads the current cluster information from its local cache and executes the query. Hence, search requests need not have the ZooKeeper ensemble running. Please bear in mind that any new instances that are added to the cluster will not be visible to the other instances.
    2. However, a write index request is a bit more complicated. An index write operation results in a new Lucene segment getting added or existing Lucene segments getting merged. This information has to be sent to ZooKeeper. Each Solr server must report to ZooKeeper which cores it has installed. Each host file is of the form host_version. It is the responsibility of each Solr host/server to match the state of the cores_Nfile. Meaning, each Solr server must install the cores defined for it and after successful install, write the hosts file out to ZooKeeper. Hence, an index write operation always needs ZooKeeper ensemble to be running.
  7. Can the ZooKeeper cluster be dynamically changed?
    1. ZooKeeper cluster is not easily changed dynamically but is part of their roadmap. A workaround is to do a Rolling Restart.
  8. Is there a way to find out which Solr instance is a leader and which one is a replica?
    1. Solr 4.0 Admin console shows these roles in a nice graphical manner.
  9. How much time does it take to add a new replica to a shard?
    1. For a shard leader index size of 500MB, an m1.medium EC2 cold-start replica instance takes about 30 seconds to synchronize its index with the leader and become part of the cluster.
  10. Is it possible to ensure that the SolrCloud cluster is HA even when an entire AZ goes out of action?
    1. When an entire AZ is down and the cluster is expected to be running successfully, the simplest and recommended approach is to have all leaders in one AZ and replicas in other AZs with the ZooKeeper ensemble spanning across AZs.

Thanks Vijay, Ramprasad and Dwarak for their support during this exercise.


Unknown said...

What is your reasoning behind using Tomcat vs. Jetty in production? Jetty gets the most testing in Solr's build farm, but its OS init scripts are a little weak. Tomcat seems to have much better support from OS distros in the form of install packages.

So aside from installation, are there other issues with using Jetty in a production environment?

Cloud Implementation said...

Nice information

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.