Comparison Analysis Apache Solr vs Amazon CloudSearch - continued from Part 2.......
Scaling the Search Tier:
Scaling the Search Tier:
Scaling up/out based on CPU load is an
important architectural design consideration for high volume websites. Scale-up
is the process of migrating from a small instance to a larger instance during
load increase whereas scale-out is the process of spawning multiple instances
to handle the load. Both are equally important and should be decided based on
cost and use case need.
Scaling out Apache Solr Instances is a
manual and complex process. When the search traffic increases beyond the
threshold of a particular server and starts affecting its performance, you have
to manually spawn new Solr EC2 instances, transfer/partition the index,
auto-warm the caches and re-route/distribute the search queries to the new Solr
EC2 instances. It requires a Solr expert in your team to identify and execute
this activity periodically.
An expert Solr admin will usually keep
close watch on the performance of the Solr servers. Solr provides an admin
interface, which has information regarding documentCache, filterCache, resultCache and statistics such as cache hit
rate, cache lookups, cache hit ratio and cache size. The admin needs to be
careful when partitioning the index (as it usually leads to the re-index of the
entire data set) and search queries have to be modified to support the presence
of multiple indices across distributed Solr servers.
Amazon CloudSearch is a fully managed
search service; it scales up and down seamlessly as the amount of data or query
volume increases. When a search instance reaches over 80% CPU utilization, Amazon
CloudSearch scales up your search domain by adding a new search instance to
handle the increased traffic. As the traffic and concurrency grows, Amazon CloudSearch
keeps adding new search instances for that domain. This automation eases the complexity and
manual labour required in scaling out process. Conversely, when a search
instance reaches below 30% CPU utilization, Amazon CloudSearch scales down your
search domain by removing the additional search instances in order to minimize
costs. This is one of the most important points in favour of Amazon
CloudSearch.
Advantage:
Amazon CloudSearch
Feature
weight: High
Index
Distribution and Partitions:
When your search data need grows and it is
not possible to accommodate the index and data volume entirely in a single Amazon
EC2 instance, the common approach is to either scale up - upgrade the hardware
or scale out – add new EC2 instances and partition the index among them.
Rapidly growing online applications that are heavily dependent upon search data
need to adopt both the strategies periodically.
Partitioning of Data and Index volumes in Apache
Solr is a critical and manual process. It involves careful planning and
execution of when the partitions have to be added into the existing Search
server farms. If this activity is not planned and executed by a Solr expert, it
might even result in down time and data corruption/loss. Apache Solr follows a
technique called as Sharding for partitioning and distributing indexes. When
you query for a data, you supply list of Solr shards as parameter to query and
aggregate results from them. Also every document need to have a unique key
(ID), because you are breaking up the index based on rows, and these rows are
distinguished from each other by their document ID. Example: “uniqueId.hashCode() % numServers” determines which server/ EC2 instance a document should be indexed
at. The ability to search across shards is built into the Apache Solr query
request handlers. You do not need to do any special configuration to activate it.
You can issue the search request to any Solr instance in the shard, and the
server will in turn delegate the same request to each of the Solr servers
identified in the shards parameter. The server will aggregate the results and
return the query response. A sample distributed search URL in Apache Solr will
look like the one below:
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr
Amazon CloudSearch is a fully managed
search service provided by AWS. It automatically
Scales up the Search instances first when the data/index volume grows, Once index
volume outgrows the Search instance type itself, Amazon CloudSearch
automatically scales out new search instances and partitions the index . Internally
Amazon CloudSearch will use multiple search instances during the scale out
approach. Conversely, when your data volume shrinks, it will scale down the
search instances automatically. This is an important automation provided by AWS
that saves lots of hardware and talent cost + eases the complexity of managing
this activity.
Advantage:
Amazon CloudSearch
Feature
weight: High
Index
Replication of Search Tier:
Replication of search index is a strategy used
to handle high volumes of search traffic in online applications. Whenever high
availability is needed or a single search instance is not able to handle the
load, we can create new search instance and replicate the indexes from master
node for HA and performance distribution.
Apache Solr has the support to replicate
the indexes. But it is a manual process and includes spawning new instances and
configuring them to enable replication between the servers. A replication
handler has to be configured on both master and slave EC2 instances. On the
master, you have to specify the “replicateAfter” values and on the slave you have to set the fully qualified URL of
the master replication handler for the attribute “masterUrl”. If at
any time the URL of the master changes, then all the slaves have to be stopped
to make the necessary changes and restarted again. Replication can be used as a
strategy for distributing load as well as for High availability in the search layer.
Amazon CloudSearch is very sophisticated in
replication. It automatically scales your search domain to meet your traffic
demands by replicating the partitions depending upon the need. Automated replication combined with scaling
ensures that your site is active with acceptable latency and performance levels
all the time. Also by automating this
entire process we can derive following benefits:
Since the search capacity closely aligns
with actual demand (load) in Amazon CloudSearch, infrastructure cost leakage of
over provisioning capacity is avoided
Since the search instance capacity is
automatically replicated and load is distributed, the search layer is robust
and highly available all times. This avoids business loss because of downtime
and poor performance. (Note: Currently a Single Search domain cannot span
across Multiple –AZ)
Since the entire process of scaling and
distribution is automated, it avoids manual labour, error and saves talent cost
for the companies.
Advantage:
Amazon Cloud Search
Feature
weight: High
Algorithms:
Apache Solr has many algorithms including
cache implementations such as LRUCache
and FastLRUCache. Solr, being open
source, it can be extended by adding your own algorithms. Since Amazon
CloudSearch is proprietary technology, algorithms cannot be extended. But
please bear in mind that the default Amazon CloudSearch algorithms are
sophisticated and will suffice for most applications use cases.
Advantage:
Apache Solr
Feature
weight: Low
High
Availability:
Amazon CloudSearch is an Highly available
service. Replicating the index across multiple search instances for performance
and availability is already taken care by Amazon CloudSearch. Only constraint
is that a single search domain currently cannot span across multiple-AZ if we
use Amazon CloudSearch. This constraint might be solved in future by AWS team.
The cost of managing the HA setup is completely taken out by AWS team.
On the other hand, we need to manually
configure Master-Slave replication of indexes in Apache Solr. We can manually configure Master and Slave
Solr EC2 instances in Multiple-AZ for High Availability inside an Amazon EC2
region. (Note: Regional data transfers for replication data apply when
configured in Multiple-AZ mode). We need to absorb the cost of configuring and
managing the HA setup in this case.
Advantage: Amazon CloudSearch
Feature Weight: High
Part 4: Economics behind choosing Amazon CloudSearch vs Apache Solr (under progress....)
Related Articles:
Introduction to Apache SolrCloud on AWS
Apache SolrCloud Implementation on Amazon VPC
Configuring Apache SolrCloud on Amazon VPC
Apache SolrCloud on AWS FAQ
Part 1: Comparison Analysis: Amazon CloudSearch vs Apache Solr
Apache SolrCloud Implementation on Amazon VPC
Configuring Apache SolrCloud on Amazon VPC
Apache SolrCloud on AWS FAQ
Part 1: Comparison Analysis: Amazon CloudSearch vs Apache Solr
No comments:
Post a Comment