Cloud, Big Data and Mobile: Part 3: Comparison Analysis: Amazon CloudSearch vs Apache Solr

Comparison Analysis Apache Solr vs Amazon CloudSearch - continued from Part 2.......

Scaling the Search Tier:

Scaling up/out based on CPU load is an important architectural design consideration for high volume websites. Scale-up is the process of migrating from a small instance to a larger instance during load increase whereas scale-out is the process of spawning multiple instances to handle the load. Both are equally important and should be decided based on cost and use case need.

Scaling out Apache Solr Instances is a manual and complex process. When the search traffic increases beyond the threshold of a particular server and starts affecting its performance, you have to manually spawn new Solr EC2 instances, transfer/partition the index, auto-warm the caches and re-route/distribute the search queries to the new Solr EC2 instances. It requires a Solr expert in your team to identify and execute this activity periodically.

An expert Solr admin will usually keep close watch on the performance of the Solr servers. Solr provides an admin interface, which has information regarding documentCache, filterCache, resultCache and statistics such as cache hit rate, cache lookups, cache hit ratio and cache size. The admin needs to be careful when partitioning the index (as it usually leads to the re-index of the entire data set) and search queries have to be modified to support the presence of multiple indices across distributed Solr servers.

Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the amount of data or query volume increases. When a search instance reaches over 80% CPU utilization, Amazon CloudSearch scales up your search domain by adding a new search instance to handle the increased traffic. As the traffic and concurrency grows, Amazon CloudSearch keeps adding new search instances for that domain. This automation eases the complexity and manual labour required in scaling out process. Conversely, when a search instance reaches below 30% CPU utilization, Amazon CloudSearch scales down your search domain by removing the additional search instances in order to minimize costs. This is one of the most important points in favour of Amazon CloudSearch.

Advantage: Amazon CloudSearch

Feature weight: High

Index Distribution and Partitions:

When your search data need grows and it is not possible to accommodate the index and data volume entirely in a single Amazon EC2 instance, the common approach is to either scale up - upgrade the hardware or scale out – add new EC2 instances and partition the index among them. Rapidly growing online applications that are heavily dependent upon search data need to adopt both the strategies periodically.

Partitioning of Data and Index volumes in Apache Solr is a critical and manual process. It involves careful planning and execution of when the partitions have to be added into the existing Search server farms. If this activity is not planned and executed by a Solr expert, it might even result in down time and data corruption/loss. Apache Solr follows a technique called as Sharding for partitioning and distributing indexes. When you query for a data, you supply list of Solr shards as parameter to query and aggregate results from them. Also every document need to have a unique key (ID), because you are breaking up the index based on rows, and these rows are distinguished from each other by their document ID. Example: “uniqueId.hashCode() % numServers” determines which server/ EC2 instance a document should be indexed at. The ability to search across shards is built into the Apache Solr query request handlers. You do not need to do any special configuration to activate it. You can issue the search request to any Solr instance in the shard, and the server will in turn delegate the same request to each of the Solr servers identified in the shards parameter. The server will aggregate the results and return the query response. A sample distributed search URL in Apache Solr will look like the one below:

http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr

Amazon CloudSearch is a fully managed search service provided by AWS. It automatically Scales up the Search instances first when the data/index volume grows, Once index volume outgrows the Search instance type itself, Amazon CloudSearch automatically scales out new search instances and partitions the index . Internally Amazon CloudSearch will use multiple search instances during the scale out approach. Conversely, when your data volume shrinks, it will scale down the search instances automatically. This is an important automation provided by AWS that saves lots of hardware and talent cost + eases the complexity of managing this activity.

Advantage: Amazon CloudSearch

Feature weight: High

Index Replication of Search Tier:

Replication of search index is a strategy used to handle high volumes of search traffic in online applications. Whenever high availability is needed or a single search instance is not able to handle the load, we can create new search instance and replicate the indexes from master node for HA and performance distribution.

Apache Solr has the support to replicate the indexes. But it is a manual process and includes spawning new instances and configuring them to enable replication between the servers. A replication handler has to be configured on both master and slave EC2 instances. On the master, you have to specify the “replicateAfter” values and on the slave you have to set the fully qualified URL of the master replication handler for the attribute “masterUrl”. If at any time the URL of the master changes, then all the slaves have to be stopped to make the necessary changes and restarted again. Replication can be used as a strategy for distributing load as well as for High availability in the search layer.

Amazon CloudSearch is very sophisticated in replication. It automatically scales your search domain to meet your traffic demands by replicating the partitions depending upon the need. Automated replication combined with scaling ensures that your site is active with acceptable latency and performance levels all the time. Also by automating this entire process we can derive following benefits:

Since the search capacity closely aligns with actual demand (load) in Amazon CloudSearch, infrastructure cost leakage of over provisioning capacity is avoided

Since the search instance capacity is automatically replicated and load is distributed, the search layer is robust and highly available all times. This avoids business loss because of downtime and poor performance. (Note: Currently a Single Search domain cannot span across Multiple –AZ)

Since the entire process of scaling and distribution is automated, it avoids manual labour, error and saves talent cost for the companies.

Advantage: Amazon Cloud Search

Feature weight: High

Algorithms:

Apache Solr has many algorithms including cache implementations such as LRUCache and FastLRUCache. Solr, being open source, it can be extended by adding your own algorithms. Since Amazon CloudSearch is proprietary technology, algorithms cannot be extended. But please bear in mind that the default Amazon CloudSearch algorithms are sophisticated and will suffice for most applications use cases.

Advantage: Apache Solr

Feature weight: Low

High Availability:

Amazon CloudSearch is an Highly available service. Replicating the index across multiple search instances for performance and availability is already taken care by Amazon CloudSearch. Only constraint is that a single search domain currently cannot span across multiple-AZ if we use Amazon CloudSearch. This constraint might be solved in future by AWS team. The cost of managing the HA setup is completely taken out by AWS team.

On the other hand, we need to manually configure Master-Slave replication of indexes in Apache Solr. We can manually configure Master and Slave Solr EC2 instances in Multiple-AZ for High Availability inside an Amazon EC2 region. (Note: Regional data transfers for replication data apply when configured in Multiple-AZ mode). We need to absorb the cost of configuring and managing the HA setup in this case.

Advantage: Amazon CloudSearch
Feature Weight: High

Part 4: Economics behind choosing Amazon CloudSearch vs Apache Solr (under progress....)

Related Articles: