In this post we are going to explore SolrCloud (version 4.X)
on AWS. In my previous articles on Solr(3.x)+AWS we saw a glimpse of how
complex it is to partition and scale out Solr nodes. Latest version of SolrCloud
was designed from ground up addressing these architectural needs for cloud
based implementations. SolrCloud comes
with highly available, fault tolerant cluster architecture built in with
distributed indexing and search capabilities.
Components of Apache SolrCloud
SolrCloud consists of following architectural components as
mentioned in the below diagram:
Zookeeper: Solr uses
Zookeeper as the cluster configuration and coordinator. Zookeeper is a
distributed file system containing information about all the Solr Nodes.
Solrconfig.xml, Schema.xml etc are stored in the repository.
- Though Zookeeper comes embedded with Solr in default package, It is recommended to run Zookeepers in separate EC2 instances in production setups.
- Since Zookeeper EC2 maintain the cluster information either m1.small/medium is enough for most of the setups. For heavily used setups we can start with m1.large for the Zookeeper.
- Single Zookeeper is not ideal for a large Solr clusters because of SPOF, it is recommended to configure Zookeepers in concert as an ensemble comprising of at least 3 EC2 nodes starting with. Every ZooKeeper server needs to know about every other ZooKeeper server in the ensemble, and a majority of EC2’s (called a Quorum) are needed to provide service. For example, in a ZooKeeper ensemble of 3 EC2’s if anyone of them fails, the cluster will be performing with the remaining 2 ZK EC2 nodes (constituting a majority). In case more than 1 ZK nodes can fail, then a 5 EC2 node ZooKeeper setup is needed to allow for the failure of up to 2 EC2 ZK at a time.
- In an event the ensemble loses the majority, then reads can be performed from the Solr cluster as usual but the writes will be restricted. We need to get ZK EC2 up and running for the majority number to be satisfied.
- Amazon EC2 regions are further sub divided into Availability Zones. In some Amazon EC2 regions there are only 2 availability zones. In such EC2 regions it becomes little complex to design the number of Zookeeper EC2 instances for HA. Imagine if we split 3:2 in two availability zones in an Amazon EC2 region, and all ZK EC2 becomes non-responsive in Zone 1a where we have deployed 3 ZK EC2’s, then the quorum loses the majority and cluster becomes read only.
- Since all the ZK EC2 instances communicate with each other and exchange the data constantly, it is recommended to use S3 backed AMI with ephemeral disks itself for the Zookeeper tier.
- Modifying the Zookeeper ensemble once started is not easy. So depending upon the HA levels you want i would recommend starting the ZK ensemble with 3/5/7/11 etc nodes. Currently the ZK ensemble modifcation is in the Solr roadmap. Current workaround for this problem is to do a rolling restart of Zookeepers. (Rolling Restart article)
- If you want minimize the recovery time objective in event of ZK failures, I would recommend having Amazon Elastic IP assigned for the Zookeeper nodes and enabling ZK Peer discovery + ZK discovery by Solr nodes though this Amazon Elastic IP. In Amazon VPC, if the Solr cluster is going to reside inside the private subnet then VPC IP is enough for the ZK EC2 instances.
Shards & Replica’s: Solr indexes can be partitioned across
multiple Solr EC2 nodes called as Shards. Index data present in every shard can
be replicated into one or more Replica Node(s). Configuring a Multi shard + Replica
architecture in Amazon EC2 is made easy compared to the previous version of
Apache Solr. Specify the number of Shards required in the setup with “-DnumShards” parameter, Add the Solr EC2 nodes they
will automatically added into the Shard architecture. Example: if you start the cluster with “–DnumShards=2”
it means that you are building a 2-shard cluster. The first Solr EC2 instance
automatically becomes “shard1” and the second Solr EC2 you launch will
automatically attached as “shard2” in the architecture. What happens when you add
a third & fourth Solr EC2 instances into the cluster? Since it’s a 2-shard
cluster, the third Solr EC2 automatically becomes a replica of shard1 and the
fourth Solr EC2 becomes a replica of shard2. You now have 4-node (2-shard 2-replica)
Solr cluster running on AWS. The first Solr EC2 node that usually becomes “shard1”
is assigned the role of a Leader for that shard group; all subsequent Solr EC2
nodes that join the same “shard1” group will become replica nodes. All writes will
happen on the Leader and the same will be pushed to the replicas. If the Leader
Node goes down, Solr does automatic fail over and one of the replicas will be
promoted as the new leader node.
To know more about Configuring SolrCloud on Amazon VPC, Refer article:
http://harish11g.blogspot.in/2013/03/Configuring-installing-Apache-Solrcloud-solr4.x-on-Amazon-VPC-EC2-AWS.html
|
How documents are
indexed in SolrCloud?
The document indexing process in the SolrCloud works as
follows:
Step 1: If an index document request is sent to a Shard leader
node, then the leader node first checks if the document belongs to the same
shard group. If document belongs to this shard group then it will be written to
the transaction log of the shard EC2 node.
On the contrary, if the document belongs to another shard group, then it
will be automatically forwarded to the leader of the respective shard group,
which in turn writes to its transaction logs.
Step 2: Shard Leader EC2 node transfers the delta of the transaction
log to all its replica nodes. The replica nodes in turn update their respective
transaction logs and send an acknowledgement to the leader node. Once the leader node receives
acknowledgements that all the replica node(s) have received the updates, the
original index request is responded back.
Step 3: If an index document request is sent to a replica
node, then the same is forwarded to the leader node in that shard.
How search is
performed in SolrCloud ?
The distributed search is completely easy in SolrCloud, you
need not worry about the number of shards/replicas/ZK, which one to query? What
happens when one of them is down?
A query request may be sent to one of the Solr EC2 nodes (either
a leader or a replica), Solr internally load balances the requests across
replicas to process the response. If the “shards.tolerant” parameter is set to “true”
then SolrCloud can continue to serve results without any interruption as long
as at least one Solr node hosts every shard data. This property returns the search result with
the documents that are available and avoids 503 error responses in event of any
node is down.
Some interesting features
in Apache SolrCloud
Soft Commit: A
hard commit calls fsync on the index files to ensure that they have been
flushed to stable storage and no data loss will result from a power failure. It
is an expensive operation as it involves disk IO. Soft commit feature
introduced in Solr 4.X to handle NRT search, is much faster as it only makes
the index change visible to the searchers but does not call fsync. This is
accomplished by writing the changes to the transaction log. Think of it as
adding documents to an in-memory writeable segment. As the size of the
transaction log increases over time, it is imperative that these are written to
disk using hard commit periodically.
Near Real Time Search
(NRT): The /get NRT handler ensures that the latest version of a document
is retrieved. It is NOT designed to be used as full-text search. Instead, its
responsibility is the guaranteed return of the latest version of a particular
document. When a /get request is raised by providing an id or ids (a set of
id), the search process is as follows –
It checks if the document is present in the transaction log
If yes, then the document is returned from the transaction
log
If no, then the latest opened searcher is used to retrieve
the document
Atomic Updates: Atomic Updates is a new feature in Solr 4.X
that allows you to update on a field level rather than on a document level (previous
versions of Solr). This means that you can update individual fields without
having to send the entire document to Solr with the update fields’ values.
Internally Solr re-adds the document to the index with the updated fields.
Please note that fields in your schema.xml must be stored=true to enable atomic
updates.
Related Articles:
Introduction to Apache SolrCloud on AWS
Apache SolrCloud Implementation on Amazon VPC
Configuring Apache SolrCloud on Amazon VPC
Apache SolrCloud on AWS FAQ
Part 1: Comparison Analysis: Amazon CloudSearch vs Apache Solr
Apache SolrCloud Implementation on Amazon VPC
Configuring Apache SolrCloud on Amazon VPC
Apache SolrCloud on AWS FAQ
Part 1: Comparison Analysis: Amazon CloudSearch vs Apache Solr
No comments:
Post a Comment