Cloud, Big Data and Mobile: Sharding Apache Solr in AWS

Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, database integration, and geospatial search. It is highly scalable, providing distributed search and index replication and it powers the search and navigation features of many of the world's largest internet sites like buy.com, AOL, eharmony, Sears, Zappos etc.
Solr is very popular among e-commerce sites as well.
Solr version 3.X is illustrated in this article. The new version of Solr 4.X is pretty advanced on Sharding, replication , Scalability and HA. Please refer following posts for more details on the same:

Introduction to Apache SolrCloud on AWS

Apache SolrCloud Implementation on Amazon VPC

Why Solr Sharding ?

Simple, Documents are indexed into Solr thereby making them searchable. The usual implementations of Solr has 1-2 EC2 instances dedicated for it. But the problem becomes Multi-fold as the size of the Solr index increases and more hits happen and dependency increases, some of the issues observed are:
(a) Solr index becomes too large to fit on a Single EC2 Server
(b) A single query takes too long to execute and respond from a single EC2 Server
(c) Concurrency and IO limits of Single EC2 Server

One solution to effectively address the above issues are Solr Sharding.

Sharding is the process of breaking a single logical index in a horizontal fashion across records and distribute them across multiple servers. It is a common database scaling strategy when you have too much data for a single database. In Solr terms, sharding is breaking up a single Solr core across multiple Solr servers. Solr has the ability to take a single query and break it up to run over multiple Solr shards, and then aggregate the results together into a single result set.

This isn’t a completely transparent operation though. The key constraint is when indexing the documents you need to decide which Solr shard gets which documents. Solr doesn’t have any logic for distributing indexed data over shards. Every document needs a unique key (ID), because you are breaking up the index based on rows, and these rows are distinguished from each other by their document ID. We can follow simple Hash Algorithms and distribute the documents between the shards. When querying for data, you supply a shards parameter that lists which Solr shards to aggregate results from. This will aggregate the results from Shards and return you the response.

Setting up Solr in the AWS

Launch an Amazon EC2 Linux instance with Sun Java installed
Download Solr latest stable release
Unzip the Solr release and change the working directory to be the “example” directory

user@solr:/opt$ tar -xzvf apache_solr_3.3.0.tgz
user@solr:/opt$ cd apache_solr_3.3.0/example

Copy the (multi-core supported) SolrHome directory, which contains the schema and config files, into this instance under /user/home
Start the Solr server by specifying the solr-home directory.

user@solr:/opt/apache_solr_3.3.0/example$ java –Dsolr.solr.home=/user/home/SolrHome –jar start.jar

Your solr server is up and running. Let’s consider that the Public URL for the solr server is: http://ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities. Repeat the above steps again to launch a similar instance with a Public URL say http://ec2-175-130-99-168.compute-1.amazonaws.com:8983/solr/businessentities. (Note: The Public URLs for your solr servers will be different when you set it up on AWS. Please use them accordingly.)

We now have 2 Solr shards running in the Amazon EC2 !!

Initializing the Solr shards

Let’s use Solrj, a java client to access Solr. It offers a java interface to add, update, and query the index.

public class SolrShards {
    private final int NO_OF_SHARDS = 2;
    private CommonsHttpSolrServer[] solrServerShard = null;
 
    public SolrShards () {
        String[] solrShardLocations = {"http://ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities", "http://ec2-175-130-99-168.compute-1.amazonaws.com:8983/solr/businessentities"};
        try {
            solrServerShard = new CommonsHttpSolrServer[NO_OF_SHARDS];
            for (int i = 0; i &lt; solrShardLocations.length; i++) {
                solrServerShard[i] = new CommonsHttpSolrServer(solrShardLocations[i]);
                ((CommonsHttpSolrServer) solrServerShard[i]).setParser(new XMLResponseParser());
            }
        }
        catch (MalformedURLException e) {
        }
    }
}

Indexing documents into shards

As mentioned above, Solr doesn’t have any logic for distributing indexed data over shards.uniqueId.hashCode() % numServers determines which server a document should be indexed at.

solrServerShard[businessEntity.getId().hashCode() % NO_OF_SHARDS].addBean(businessEntity);
solrServerShard[businessEntity.getId().hashCode() % NO_OF_SHARDS].commit();

Distributed Search: Searching across shards

The ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it. In order to search across two shards, you would issue a search request to Solr, and specify in a shards URL parameter a comma delimited list of all of the shards to distribute the search across. You can issue the search request to any Solr instance, and the server will in turn delegate the same request to each of the Solr servers identified in the shards parameter. The server will aggregate the results and return the standard response format.

public class SolrShards {
    private CommonsHttpSolrServer getSolrServer () {
        String solrLocation = "http://ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities";
        CommonsHttpSolrServer solrServer = null;
        try {
            solrServer = new CommonsHttpSolrServer(solrLocation);
            ((CommonsHttpSolrServer)solrServer).setParser(new XMLResponseParser());
        }
        catch (MalformedURLException e)  {}
        return solrServer;
    }
 
    public void QueryBusinessEntities() {
        SolrServer solrServer = getSolrServer ();
        SolrQuery query = new SolrQuery();
        query.setParam("shards", "ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities, ec2-175-130-99-168.compute-1.amazonaws.com:8983/solr/businessentities");
        query.setQuery("address:India");
        query.addSortField("profit", SolrQuery.ORDER.asc);
        QueryResponse rsp = solrServer.query( query );
    }
}

That’s it! You have set up 2 Solr shards on 2 different machines and performed distributed search successfully.

Solr Standalone v/s Solr Shards Comparison

A simple test validate the performance of Solr Standalone vs Solr Shards was conducted. M1.Large was the Solr EC2 instance type attached with EBS volume. All default configs were used in this simple test, we believe more throughput can be achieved by tuning at various levels.The size of each Solr record used is approximately 1.3KB.

Test 1: Time Taken to Index the records in Standalone Solr vs Solr Shards

Table 1: Indexing Times

Test 2: Query times between Standalone Solr vs Sharded Solr

Table 2: Response Times

Test 1: Observations on Indexing/writing
There was no change in the indexing time (on an average, 1 million records were indexed in 1 hour). But the Number of records indexed was double in the same time using Solr Shards. If you need high parallel write performance in your application use case , Solr shards is an effective methodology to achieve the same.

Test 2: Observations on Querying
As you can see from Table 2, there is no observable change in response time too though the response from sharded servers contain twice the amount of data than that of stand alone server. The queries were run in parallel in the Solr EC2 shards and results were aggregated and returned back. Sharding is the recommended approach when dealing with big data.

Points to remember while Solr Sharding in Amazon EC2

We used Amazon EC2 m1.large instance type for the above tests. They come with 64 bit, 2 CPU cores and 7 GB RAM. This is the minimum instance type recommended by us to run a Solr Shard in AWS for production environment.
Small, Micro, High CPU-Medium Instances are 32 bit and have less memory. For Solr to perform well give it more memory and Large instance type is suitable for most use cases to start with.
We have also observed 2 X High memory Instance type are better than 4 X m1.large Amazon EC2 instance type in AWS for Solr Sharding. High Memory EC2 instances come with lesser CPU steal cycles, Better I/O than m1.large. More EC2 servers ~ More Maintenance Effort of IT ops
Have 4 X EBS disks attached in RAID0, use EBS striping and other techniques discussed in AWS forums for extracting better I/O performance in Amazon. If you have deep pockets and can altogether avoid this with more RAM based EC2 instances, High IO SSD EC2 instances etc
EC2 Instances with High IO(SSD, m2.4xlarge etc) offer better bandwidth between App and SOlr EC2
RAID 0(4 X 1TB)+EBS optimized EC2 Instances+Provisioned IOPS is good combination for configuring Solr on Amazon EC2 (depending on affordability)
As much as possible keep the Web/App and SOlr in closer AZ's

Documents must have a unique key and the unique key must be stored (stored=”true” in schema.xml)
The unique key field must be unique across all shards
You typically only need sharding when you have millions of records of data to be searched which cannot be served from Single EC2 server effectively
If running a single query is fast enough, and if you are looking for capacity increase to handle more users, then use the index replication approach instead!

EBS backed AMI's are preferred compared to S3 Backed AMI's for launching Solr servers in Amazon EC2. EBS storage is essential for persisting Index info.
In case S3 backed AMI are used, keep the indexes on EBS volumes
if the shards are replicated for HA then;

Use Amazon ELB internal Load Balancing for Load balancing Solr inside VPC
Use HAProxy on Public EC2 and Amazon VPC for load Balancing Solr.
Both the above setups are recommended only for advanced use cases and needs to evaluated and configured by a System Integrator.

Have a separate AWS Security group created for restricting the machines accessing the Solr Shards
Do not configure Solr in Amazon Auto Scaling mode.It is not recommended at all. Some people have asked me this strange question in seminars for scaling Solr Servers

Amazon CloudWatch alone is not effective for monitoring Solr EC2 shards. need to use combination of NewRelic, CloudWatch , Nagios etc to monitor Solr Shards
Configure EBS snapshots based backup with custom scripts in AWS for regular backups of Solr Indexes. XFS is good for freezing and taking backups in RAID.
Use AWS Reserved Instance Pricing only after 1-2 months of stable production environment. If you feel you have over capacity please use AWS Reserved Instance Marketplace.

For more information, please refer the following links:

Article was co authored by Vijay Olety.

Related Articles

Introduction to Apache SolrCloud on AWS
Apache SolrCloud Implementation on Amazon VPC
Part 1: Comparison Analysis: Amazon CloudSearch vs Apache Solr

Part 2: Comparison Analysis: Amazon CloudSearch vs Apache Solr

Part 3: Comparison Analysis: Amazon CloudSearch vs Apache Solr

Part 4: Economics behind choosing Amazon CloudSearch vs Apache Solr
Part 5: Comparison Analysis: Amazon CloudSearch vs Apache Solr

2 comments:

Unknown said...: "Amazon CloudWatch is not effective for monitoring Solr EC2 shards. "

"Do not configure Solr in Amazon Auto Scaling mode."

I use CloudWatch and AutoScaling policies for another solution. I have had good success configuring/monitoring custom metrics linked to autoscaling policies via AWS Java API. Wondering why is not a good idea for Solr in AWS ?

Great article. Thanks!; July 16, 2012 at 7:47 AM
Frank Kelly said...: The table 1 and table 2 images are no longer visible. Great article BTW.; November 9, 2015 at 12:55 PM

Cloud, Big Data and Mobile

Pages

Saturday, February 18, 2012

Sharding Apache Solr in AWS

2 comments:

Need Consulting help ?

Followers

My Presentations / Webinars / Conferences

Popular Posts - All Time

My Articles

SlideShares