Cloud, Big Data and Mobile: Apache Solr Master-Slave Replication & Mitigation Strategies

Solr is the open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.

To know more about Apache Solr Features in detail and where it stands in comparison with other popular cloud search engines refer article: http://harish11g.blogspot.in/2013/01/amazon-cloudsearch-vs-apache-solr_16.html

A single Solr server can handle only a certain number of requests without affecting performance. In such, not so uncommon, scenarios it is best to set up a Solr master-slave cluster so that the load can be balanced effectively among them. Master usually takes up the task of index updates whereas the slaves’ responsibilities are to poll the master for updates and handle the ever-increasing search requests.

This article explains the index replication that works over HTTP and how to set it up using Solr 3.6. Let’s get started!

Solr has launched 4.X with better features on replication, sharding and High Availability. Please check this post to understand more about SolrCloud 4.x

Introduction to Apache SolrCloud on AWS

Apache SolrCloud Implementation on Amazon VPC

Index Replication

A master-slave replication includes both index replication and (an optional) configuration files replication. Index replication, as the phrase indicates, is the replication of Lucene index from the master to the slaves. The slaves poll the master for any updates and the master sends a delta of the index so that everyone can be in sync.

Setting up master-slave replication

Open the file solrconfig.xml and add the following -

<str name="enable">${enable.master:false}</str>

<str name="replicateAfter">startup</str>

<str name="replicateAfter">commit</str>

</lst>

<str name="enable">${enable.slave:false}</str>

<str name="masterUrl">http://master_server_ip:solr_port/solr/replication</str>

</lst>

</requestHandler>

Note that ${enable.master:false} and ${enable.slave:false} are false indicating that currently this machine is neither set up as a master nor a slave. These settings HAVE to be overridden by specifying the values in the file solrcore.properties which is located under the conf directory of each core’s instance directory.

On the master server, open the file solrcore.properties and add the following -

enable.master=true
enable.slave=false

On the slave server, open the file solrcore.properties and add the following -

enable.master=false
enable.slave=true

Fire up these machines and you have a master-slave Solr cluster ready!

Repeater

A master may be able to serve only so many slaves without adversely affecting performance. Some organizations have deployed slave servers across multiple Data centers. If each slave downloads the index from a remote data center, the resulting download may consume too much network bandwidth. To avoid performance degradation in cases like this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both a master and a slave. (enable.master=true
enable.slave=true)

Note: Be sure to have replicateAfter ‘commit’ setup on repeater even if replicateAfter is set to optimize on the main master. This is because on a repeater (or any slave), only a commit is called after index is downloaded. Optimize is never called on slaves.

As with our lives, nothing is certain in the lives of machines too! Any machine can go down at any time and there is nothing we can do about it except to plan for such inevitable cases and have a mitigation strategy in place.

Mitigation Strategies when master is down

Since master-slave replication is done pull-style, there are always inconsistencies with the indices of the master and the slaves. When some loss of updates are acceptable -

Mitigation Plan 1: Every machine is either a master or a slave and not BOTH

1. Nominate one of the slaves as master
2. Stop the Solr server on the new master
3. Change the solrcore.properties to promote it as master.
4. Start the Solr server of new master
5. Detach the EIP from the failed master and associate with the new nominated master.
6. That’s it!

Mitigation Plan 2: Every machine is both master and slave (Concept of Repeater)

1. Nominate one of the instances as master
2. Detach the EIP from the failed master and associate with the new nominated master.
3. That’s it!

In each of the mitigation plans, the first step is to nominate a slave. The obvious question arises – How do we decide which slave is the best-fit?

We have to choose that slave whose index is closest to master. To carry out this operation, use the LukeRequestHandler (enabled by default) and query the version parameter. This parameter shows the timestamp in milliseconds of the last index operation. Pick the slave which satisfies the following conditions -

1. Retrieve the version attribute on the master from S3. (Aside: Since the master is down currently, there is no way you can get the version of the master. Hence, you have to query and store the master version in S3 periodically when the master was running!)
2. Query the version on all solr slaves.
3. Among the slaves, pick the slave that has the highest version. That is the best nomination.
4. As a double check, check that the nominated slave’s version is closest or equal to that of the master (Replicating a master index means copying index as-is from the master to slaves. That’s why lastModified and version are the same on a slave once replication is successful. This is the reason why slave version can never be greater than that of the master)

However in production environments, any loss of updates is not acceptable and hence more robust mitigation plans need to be in place.

Mitigation Plan 1

1. Detach EBS from master and mount to any slave in the same AZ. (This is because of EBS restriction)
2. Reattach EIP from master to the slave
3. That’s it!

Mitigation Plan 2

1. Use GlusterFS as network file system. Index is automatically replicated across AZ and regions.
2. Reattach EIP to the secondary master.
3. That’s it!

Mitigation Plan 3

1. Use SolrCloud feature of Solr 4.0!. To know more about SolrCloud deployment strategies check : http://harish11g.blogspot.com/2013/03/Apache-Solr-cloud-on-Amazon-EC2-AWS-VPC-implementation-deployment.html

Original article was authored by vijay . He can be reached @ in.linkedin.com/in/vijayolety/

References

1. http://wiki.apache.org/solr/SolrReplication

Related Articles: