Saturday, February 18, 2012

Sharding Apache Solr in AWS

Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, database integration, and geospatial search. It is highly scalable, providing distributed search and index replication and it powers the search and navigation features of many of the world's largest internet sites like, AOL, eharmony, Sears, Zappos etc.
Solr is very popular among e-commerce sites as well.
Solr version 3.X is illustrated in this article. The new version of Solr 4.X is pretty advanced on Sharding, replication , Scalability and HA. Please refer following posts for more details on the same:

Why Solr Sharding ?

Simple, Documents are indexed into Solr thereby making them searchable. The usual implementations of Solr has 1-2 EC2 instances dedicated for it. But the problem becomes Multi-fold as the size of the Solr index increases and more hits happen and dependency increases, some of the issues observed are:
(a) Solr index becomes too large to fit on a Single EC2 Server  
(b) A single query takes too long to execute and respond from a single EC2 Server
(c) Concurrency and IO limits of Single EC2 Server
One solution to effectively address the above issues are Solr Sharding.
Sharding is the process of breaking a single logical index in a horizontal fashion across records and distribute them across multiple servers. It is a common database scaling strategy when you have too much data for a single database. In Solr terms, sharding is breaking up a single Solr core across multiple Solr servers. Solr has the ability to take a single query and break it up to run over multiple Solr shards, and then aggregate the results together into a single result set.
This isn’t a completely transparent operation though. The key constraint is when indexing the documents you need to decide which Solr shard gets which documents. Solr doesn’t have any logic for distributing indexed data over shards. Every document needs a unique key (ID), because you are breaking up the index based on rows, and these rows are distinguished from each other by their document ID. We can follow simple Hash Algorithms and distribute the documents between the shards. When querying for data, you supply a shards parameter that lists which Solr shards to aggregate results from. This will aggregate the results from Shards and return you the response.

Setting up Solr in the AWS
  1. Launch an Amazon EC2 Linux instance with Sun Java installed
  2. Download Solr latest stable release
  3. Unzip the Solr release and change the working directory to be the “example” directory
    1. user@solr:/opt$ tar -xzvf apache_solr_3.3.0.tgz
      user@solr:/opt$ cd apache_solr_3.3.0/example
  4. Copy the (multi-core supported) SolrHome directory, which contains the schema and config files, into this instance under /user/home
  5. Start the Solr server by specifying the solr-home directory.
    1. user@solr:/opt/apache_solr_3.3.0/example$ java –Dsolr.solr.home=/user/home/SolrHome –jar start.jar
Your solr server is up and running. Let’s consider that the Public URL for the solr server is: Repeat the above steps again to launch a similar instance with a Public URL say (Note: The Public URLs for your solr servers will be different when you set it up on AWS. Please use them accordingly.)
We now have 2 Solr shards running in the Amazon EC2 !!
Initializing the Solr shards
Let’s use Solrj, a java client to access Solr. It offers a java interface to add, update, and query the index.
public class SolrShards {
    private final int NO_OF_SHARDS = 2;
    private CommonsHttpSolrServer[] solrServerShard = null;
    public SolrShards () {
        String[] solrShardLocations = {"", ""};
        try {
            solrServerShard = new CommonsHttpSolrServer[NO_OF_SHARDS];
            for (int i = 0; i < solrShardLocations.length; i++) {
                solrServerShard[i] = new CommonsHttpSolrServer(solrShardLocations[i]);
                ((CommonsHttpSolrServer) solrServerShard[i]).setParser(new XMLResponseParser());
        catch (MalformedURLException e) {
Indexing documents into shards
 As mentioned above, Solr doesn’t have any logic for distributing indexed data over shards.uniqueId.hashCode() % numServers  determines which server a document should be indexed at.
solrServerShard[businessEntity.getId().hashCode() % NO_OF_SHARDS].addBean(businessEntity);
solrServerShard[businessEntity.getId().hashCode() % NO_OF_SHARDS].commit();
Distributed Search: Searching across shards
The ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it. In order to search across two shards, you would issue a search request to Solr, and specify in a shards URL parameter a comma delimited list of all of the shards to distribute the search across. You can issue the search request to any Solr instance, and the server will in turn delegate the same request to each of the Solr servers identified in the shards parameter. The server will aggregate the results and return the standard response format.
public class SolrShards {
    private CommonsHttpSolrServer getSolrServer () {
        String solrLocation = "";
        CommonsHttpSolrServer solrServer = null;
        try {
            solrServer = new CommonsHttpSolrServer(solrLocation);
            ((CommonsHttpSolrServer)solrServer).setParser(new XMLResponseParser());
        catch (MalformedURLException e)  {}
        return solrServer;
    public void QueryBusinessEntities() {
        SolrServer solrServer = getSolrServer ();
        SolrQuery query = new SolrQuery();
        query.setParam("shards", ",");
        query.addSortField("profit", SolrQuery.ORDER.asc);
        QueryResponse rsp = solrServer.query( query );
That’s it! You have set up 2 Solr shards on 2 different machines and performed distributed search successfully.
Solr Standalone v/s Solr Shards Comparison

A simple test validate the performance of Solr Standalone vs Solr Shards was conducted. M1.Large was the Solr EC2 instance type attached with EBS volume. All default configs were used in this simple test, we believe more throughput can be achieved by tuning at various levels.The size of each Solr record used is approximately 1.3KB.

Test 1: Time Taken to Index the records in Standalone Solr vs Solr Shards
Table 1: Indexing Times

Test 2: Query times  between  Standalone Solr vs Sharded Solr 
Table 2: Response Times
Test 1: Observations on Indexing/writing
There was no change in the indexing time (on an average, 1 million records were indexed in 1 hour). But the Number of records indexed was double in the same time using Solr Shards. If you need high parallel write performance in your application use case , Solr shards is an effective methodology to achieve the same. 

Test 2: Observations on Querying
As you can see from Table 2, there is no observable change in response time too though the response from sharded servers contain twice the amount of data than that of stand alone server. The queries were run in parallel in the Solr EC2 shards and results were aggregated and returned back. Sharding is the recommended approach when dealing with big data.

Points to remember while Solr Sharding in Amazon EC2
  • We used Amazon EC2 m1.large instance type for the above tests. They come with 64 bit, 2 CPU cores and 7 GB RAM. This is the minimum instance type recommended by us to run a Solr Shard in AWS for production environment. 
  • Small, Micro, High CPU-Medium Instances are 32 bit and have less memory. For Solr to perform well give it more memory and Large instance type is suitable for most use cases to start with.
  • We have also observed 2 X High memory Instance type are better than 4 X m1.large Amazon EC2 instance type in AWS for Solr Sharding. High Memory EC2 instances come with lesser CPU steal cycles, Better I/O than m1.large. More EC2 servers ~ More Maintenance Effort of IT ops
  • Have 4 X EBS disks attached in RAID0, use EBS striping and other techniques discussed in AWS forums for extracting better I/O performance in Amazon. If you have deep pockets and can altogether avoid this with more RAM based EC2 instances, High IO SSD EC2 instances etc 
  • EC2 Instances with High IO(SSD, m2.4xlarge etc) offer better bandwidth between App and SOlr EC2
  • RAID 0(4 X 1TB)+EBS optimized EC2 Instances+Provisioned IOPS is good combination for configuring Solr on Amazon EC2 (depending on affordability)
  • As much as possible keep the Web/App and SOlr in closer AZ's
  • Documents must have a unique key and the unique key must be stored (stored=”true” in schema.xml)
  • The unique key field must be unique across all shards
  • You typically only need sharding when you have millions of records of data to be searched which cannot be served from Single EC2 server effectively
  • If running a single query is fast enough, and if you are looking for capacity increase to handle more users, then use the index replication approach instead!
  • EBS backed AMI's are preferred compared to S3 Backed AMI's for launching Solr servers in Amazon EC2. EBS storage is essential for persisting Index info.
  • In case S3 backed AMI are used, keep the indexes on EBS volumes
  • if the shards are replicated for HA then;
    • Use Amazon ELB internal Load Balancing for Load balancing Solr inside VPC 
    • Use HAProxy on Public EC2 and Amazon VPC for load Balancing Solr.
    • Both the above setups are recommended only for advanced use cases and needs to evaluated and configured by a System Integrator.
  • Have a separate AWS Security group created for restricting the machines accessing the Solr Shards
  • Do not configure Solr in Amazon Auto Scaling mode.It is not recommended at all. Some people have asked me this strange question in seminars for scaling Solr Servers
  • Amazon CloudWatch alone is not effective for monitoring Solr EC2 shards. need to use combination of NewRelic, CloudWatch , Nagios etc to monitor Solr Shards   
  • Configure EBS snapshots based backup with custom scripts in AWS for regular backups of Solr Indexes. XFS is good for freezing and taking backups in RAID.
  • Use AWS Reserved Instance Pricing only after 1-2 months of stable production environment. If you feel you have over capacity please use AWS Reserved Instance Marketplace.
For more information, please refer the following links:


Steve Abramson said...

"Amazon CloudWatch is not effective for monitoring Solr EC2 shards. "

"Do not configure Solr in Amazon Auto Scaling mode."

I use CloudWatch and AutoScaling policies for another solution. I have had good success configuring/monitoring custom metrics linked to autoscaling policies via AWS Java API. Wondering why is not a good idea for Solr in AWS ?

Great article. Thanks!

Frank Kelly said...

The table 1 and table 2 images are no longer visible. Great article BTW.

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.