Wednesday, January 16, 2013

Full Article : Amazon CloudSearch vs Apache Solr


Search module plays an integral role in today’s websites and online applications. It is an entry point for many online business transactions and the user experience of these services are very critical. Search feature has the power to make or break a online business.  In this article, we are going to compare Apache Solr - one of  the most popular OSS Search Engine with recently introduced Amazon CloudSearch web service in detail. (Note: i have taken Apache Solr v3.6 in this article)
               
Breed :
Apache Solr is open source software. It is written entirely in Java and uses Lucene under the hood. Amazon CloudSearch is a proprietary creation and is based on Amazon’s time tested A9 technology. Highly specialized search solution companies like lucidworks, search technologies etc may prefer creating plugins and modules using open source code. Though open source gives us the flexibility and control, not many companies/business change the source and customize them often. In general, we have seen more business prefer using Search modules as appliances and services. Customers are happy if you provide them a robust service with well-defined feature set and relieve them from operation intricacies (like Amazon CloudSearch).  
Advantage:  Neutral
Feature Weight: Medium

Configuration & Getting started Effort:
Deep understanding and considerable Time & effort is needed to properly configure Apache Solr and get it up & running in Amazon EC2 infrastructure. It includes common tasks such as Apache Solr download, knowledge of Java, configuration of environment variables, deploying it in a server, properly configuring, understanding the admin commands, applying patches, tuning performance and upgrading to newer versions. This is just the start, when your application grows rapidly; you need to factor High Availability, scalability and partitions for the search tier as well. Application and IT teams needs to be aware of replication and sharding technologies of Apache Solr and configure the same depending upon the need. On the other hand, Amazon CloudSearch is a fully managed search service which offloads the administrative burden of operating your search tier.  You can get started with Amazon CloudSearch in few clicks using the AWS Management Console. Customers need not to worry about hardware provisioning, data partitioning, running out of disk space and planning of compute capacity or software patches.
Advantage: Amazon CloudSearch
Feature Weight: High

Multilingual Support :
Apache Solr has multilingual support. Custom analysers and tokenizers have to be written and plugged in for this functionality. One of the recommended approaches for using multilingualism in Apache Solr is to have a multi-core architecture with each core addressing one language.
Currently Amazon CloudSearch supports only English language for tokenizing words in the index. Though it is not a critical one, it is a good to have feature for applications which offers localized services for worldwide audience.
Advantage: Apache Solr
Feature Weight: Low

Faceted Search:
Faceting is one of the important features used in ecommerce website search modules. Faceting allows you to categorize your results into sub-groups, which can be used as the basis for another search. In recent times, faceting has gained popularity by allowing users to narrow down search results in an easy-to-use and efficient way.
Faceting can be best explained with the help of a picture (See figure below from Amazon.com). As you can observe on left side of the figure, a search for “java programming” results in a lot of hits. You can clearly see that the search resulted in 3 facets (or sub-groups) using which you can narrow down your search. For example: if you click on “PDF” in the “Format” facet (see “Facet 2” in the figure), the search query now essentially means “java programming AND only pdf format”, thereby narrowing down the search space eventually leading you to better and convenient results. You can also observe that each member of a facet is accompanied by a number called Facet Count. In the “Format” facet, you can see “PDF (14)” which means that there are 14 “java programming” results in PDF format. The important aspect of this feature is that as you go deeper using facets, the resultant search space is vastly reduced and hence the search will be considerably faster.
Both Apache Solr and Amazon CloudSearch allow the user to perform faceting with minimal effort.



Advantage: Neutral
Feature Weight: High

Field Weighting / Boosting:
Field Weighting is a process of assigning different prominence's to the same word when present in different places in a document. For example when the phrase “Harry Potter” is present in the title of a document, it is ranked higher than when the same phrase is present in the References section of a document.     
Both Apache Solr and Amazon CloudSearch allow field boosting with minimal effort.
Advantage: Neutral
Feature Weight: High


Auto Suggest:
Often we find in many search boxes that when a user types a search query, suggestions of popular queries in relevance to the input are presented. Also we can find that the suggestion list is refined as additional characters are typed in by the user. This feature is called as Auto Suggest. This feature can be implemented at the Search Engine level or at the Search Application level.             
Apache Solr has the native support for autosuggest feature. It can be facilitated in many ways using – NGramFilterFactory, EdgeNGramFilterFactory or TermsComponent. Usually you can find this feature of Apache Solr is used in conjunction with jQuery for creating powerful auto suggestion experience in applications.
Amazon CloudSearch has no direct support for this autosuggest feature currently. We have to implement the same in our search application tier.
Advantage: Apache Solr
Feature Weight: Medium

Geospatial Search:
Consider an example - where a user performs a search for “Starbucks”, the search engine module must show the nearest outlet based on the user’s current location. Such location-aware searches will always produce significantly better results and helps the user in finding the right information more effectively and efficiently. This use-case signifies the importance of Geospatial search. In today’s mobile world, it is an important feature in many location aware business applications.
Apache Solr supports geospatial search through the implementation class solr.LatLonType. Actions such as sorting the results by distance and boosting documents by distance can be performed.
Amazon CloudSearch has a very limited geospatial search feature set. As of now, Amazon CloudSearch has the capability to return documents within a specific area. Missing features include sorting by geographical distance and faceting by distance.
Advantage: Apache Solr
Feature Weight:  Medium

“Find Similar” feature:
The search engine suggests similar records based on a particular record. It is similar to the “Find Similar” resumes feature used by popular job search engines. Ecommerce sites also benefit from this feature as research suggests that users typically compare products before making a transaction and are likely to buy a product which is better. Apache Solr implements this feature using handlers/components like MoreLikeThisHandler or MoreLikeThisComponent
Amazon CloudSearch currently does not support this feature.
Advantage: Apache Solr
Feature Weight: High 


“Did you mean…” feature:
Sometimes when you search for a word, you will be presented with correct spelling.  Search engines like google automatically correct the spelling and present you with even the search result. This feature of presenting the user with spelling corrected suggestions is called “Did you mean” feature.
Apache Solr supports this feature with the Spellcheck search component. The recommended approach is to build a word corpus based on the index principally because your data will contain proper nouns and other words not present in a general-purpose dictionary.
Amazon CloudSearch has no support for “Did you mean…” feature currently
Advantage: Apache Solr
Feature weight: High

Rich Documents Support:

Rich document types like HTML, PDF, Word etc can be uploaded into the search engine for providing searchable access. These uploaded documents will be parsed into a native format and indexed by the search engines. Such indexed documents can be searched using the common search terms and patterns by the users/applications. Usually systems like DocumentManagement, CMS etc use this feature of a search engine/service to help itscustomers search through the documents uploaded. Typically in enterprise scenario you can expect variety of document formats to flow into the search systems from different applications.      
Apache Solr has support for rich document parsing & indexing using Apache Tika.
Amazon CloudSearch expects data to be in Search Data Format (JSON & XML). CloudSearch supports uploading rich documents via the Console, or via the cs-generate-sdf command line tool. With CloudSearch you can use cs-generate-sdf to extract the data on the client, and send the text to CloudSearch.
Advantage: Neutral
Feature Weight: High

Feature Customization:
Sometimes search software’s may not support some specific feature natively because there might not be sufficient demand for them to be added in core. In such cases, some search software’s provide capability to customize and extend their existing feature sets as plugins and modules.           Amazon CloudSearch, being a proprietary creation, does not allow for any customization either through plugin integration or via extending functionalities. Features will be rolled out only by AWS team. In my experience with AWS team, they are usually very proactive, accessible and receptive. You can speak to AWS architect or product manager and explain your specific need.  In case if your specific need is not be as specific as you think and it is being asked by considerable number of customers around the world, they will include this in their road map.
Apache Solr, being open source, allows customizations of analysers, tokenizers, indexers, query analysis through plugins and via extending their code base.
Advantage: Apache Solr
Feature weight: Medium


Stemming, Stop Words and Synonyms:
Stemming: A stemming dictionary maps related words to a common stem. A stem is typically the root or base word from which variants are derived. For example, run is the stem of running and ran.
Stop words: Stopwords are words that should typically be ignored both during indexing and at search time because they are either insignificant or so common that including them would result in a massive number of matches. Example: a,an, and, the, to… etc are some commonly used words which can be ignored during indexing.
Synonyms: You can configure synonyms for terms that appear in the data you are searching. That way, if a user searches for the synonym rather than the indexed term, the results will include documents that contain the indexed term. For example, you might want to configure synonyms so that a search for "Rocky Four" or "Rocky 4" will match the movie titled "Rocky IV". To do that, you would configure 4 and four as synonyms of the indexed term IV
Both Apache Solr and Amazon Cloud Search support these features.
Advantage: Neutral
Feature Weight: High

Support for protocols:
Both Amazon CloudSearch and Apache Solr support HTTP & HTTPS protocols. Amazon CloudSearch supports HTTPS and includes web service interfaces to configure firewall settings that control network access to your domain.
Advantage: Neutral
Feature Weight : High


Scaling the Search Tier:
Scaling up/out based on CPU load is an important architectural design consideration for high volume websites. Scale-up is the process of migrating from a small instance to a larger instance during load increase whereas scale-out is the process of spawning multiple instances to handle the load. Both are equally important and should be decided based on cost and use case need.
Scaling out Apache Solr Instances is a manual and complex process. When the search traffic increases beyond the threshold of a particular server and starts affecting its performance, you have to manually spawn new Solr EC2 instances, transfer/partition the index, auto-warm the caches and re-route/distribute the search queries to the new Solr EC2 instances. It requires a Solr expert in your team to identify and execute this activity periodically.
An expert Solr admin will usually keep close watch on the performance of the Solr servers. Solr provides an admin interface, which has information regarding documentCachefilterCacheresultCache and statistics such as cache hit rate, cache lookups, cache hit ratio and cache size. The admin needs to be careful when partitioning the index (as it usually leads to the re-index of the entire data set) and search queries have to be modified to support the presence of multiple indices across distributed Solr servers.
Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the amount of data or query volume increases. When a search instance reaches over 80% CPU utilization, Amazon CloudSearch scales up your search domain by adding a new search instance to handle the increased traffic. As the traffic and concurrency grows, Amazon CloudSearch keeps adding new search instances for that domain.  This automation eases the complexity and manual labour required in scaling out process. Conversely, when a search instance reaches below 30% CPU utilization, Amazon CloudSearch scales down your search domain by removing the additional search instances in order to minimize costs. This is one of the most important points in favour of Amazon CloudSearch.
Advantage: Amazon CloudSearch
Feature weight: High

Index Distribution and Partitions:
When your search data need grows and it is not possible to accommodate the index and data volume entirely in a single Amazon EC2 instance, the common approach is to either scale up - upgrade the hardware or scale out – add new EC2 instances and partition the index among them. Rapidly growing online applications that are heavily dependent upon search data need to adopt both the strategies periodically.
Partitioning of Data and Index volumes in Apache Solr is a critical and manual process. It involves careful planning and execution of when the partitions have to be added into the existing Search server farms. If this activity is not planned and executed by a Solr expert, it might even result in down time and data corruption/loss. Apache Solr follows a technique called as Sharding for partitioning and distributing indexes. When you query for a data, you supply list of Solr shards as parameter to query and aggregate results from them. Also every document need to have a unique key (ID), because you are breaking up the index based on rows, and these rows are distinguished from each other by their document ID. Example: “uniqueId.hashCode() % numServers” determines which server/ EC2 instance a document should be indexed at. The ability to search across shards is built into the Apache Solr query request handlers. You do not need to do any special configuration to activate it. You can issue the search request to any Solr instance in the shard, and the server will in turn delegate the same request to each of the Solr servers identified in the shards parameter. The server will aggregate the results and return the query response. A sample distributed search URL in Apache Solr will look like the one below:
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr
Amazon CloudSearch is a fully managed search service provided by AWS.  It automatically Scales up the Search instances first when the data/index volume grows, Once index volume outgrows the Search instance type itself, Amazon CloudSearch automatically scales out new search instances and partitions the index . Internally Amazon CloudSearch will use multiple search instances during the scale out approach. Conversely, when your data volume shrinks, it will scale down the search instances automatically. This is an important automation provided by AWS that saves lots of hardware and talent cost + eases the complexity of managing this activity.
Advantage: Amazon CloudSearch
Feature weight: High

Index Replication of Search Tier:
Replication of search index is a strategy used to handle high volumes of search traffic in online applications. Whenever high availability is needed or a single search instance is not able to handle the load, we can create new search instance and replicate the indexes from master node for HA and performance distribution.
Apache Solr has the support to replicate the indexes. But it is a manual process and includes spawning new instances and configuring them to enable replication between the servers. A replication handler has to be configured on both master and slave EC2 instances. On the master, you have to specify the “replicateAfter” values and on the slave you have to set the fully qualified URL of the master replication handler for the attribute “masterUrl”. If at any time the URL of the master changes, then all the slaves have to be stopped to make the necessary changes and restarted again. Replication can be used as a strategy for distributing load as well as for High availability in the search layer.
Amazon CloudSearch is very sophisticated in replication. It automatically scales your search domain to meet your traffic demands by replicating the partitions depending upon the need.  Automated replication combined with scaling ensures that your site is active with acceptable latency and performance levels all the time.  Also by automating this entire process we can derive following benefits:
Since the search capacity closely aligns with actual demand (load) in Amazon CloudSearch, infrastructure cost leakage of over provisioning capacity is avoided
Since the search instance capacity is automatically replicated and load is distributed, the search layer is robust and highly available all times. This avoids business loss because of downtime and poor performance. (Note: Currently a Single Search domain cannot span across Multiple –AZ)
Since the entire process of scaling and distribution is automated, it avoids manual labour, error and saves talent cost for the companies.
Advantage: Amazon Cloud Search
Feature weight: High

Algorithms:
Apache Solr has many algorithms including cache implementations such as LRUCache and FastLRUCache. Solr, being open source, it can be extended by adding your own algorithms. Since Amazon CloudSearch is proprietary technology, algorithms cannot be extended. But please bear in mind that the default Amazon CloudSearch algorithms are sophisticated and will suffice for most applications use cases.
Advantage: Apache Solr
Feature weight: Low

High Availability:
Amazon CloudSearch is an Highly available service. Replicating the index across multiple search instances for performance and availability is already taken care by Amazon CloudSearch. Only constraint is that a single search domain currently cannot span across multiple-AZ if we use Amazon CloudSearch. This constraint might be solved in future by AWS team. The cost of managing the HA setup is completely taken out by AWS team.
On the other hand, we need to manually configure Master-Slave replication of indexes in Apache Solr.  We can manually configure Master and Slave Solr EC2 instances in Multiple-AZ for High Availability inside an Amazon EC2 region. (Note: Regional data transfers for replication data apply when configured in Multiple-AZ mode). We need to absorb the cost of configuring and managing the HA setup in this case.

Advantage: Amazon CloudSearch
Feature Weight: High


Cost:
Refer this article for detailed cost comparison:Part 4: Economics behind choosing Amazon CloudSearch vs Apache Solr

Advantage: Amazon CloudSearch
Weight: High




I have summarized all the features compared in above into a table for easy reference. Table is listed below:
* means positive, X means negative
Weight: High/Medium/Low are the importance of a feature (my perspective)


Feature
Weight
Amazon CloudSearch
Apache Solr on EC2
1.          
Getting Started
High
*
X
2.          
Scalability
High
*
X
3.          
Partitioning
High
*
X
4.          
Index Replication
High
*
X
5.          
High Availability
High
*
X
6.          
Cost
High
*
X
7.          
Faceted Search
High
*
*
8.          
Field Weighting/Boosting
High
*
*
9.          
Rich Documents Support
High
*
*
10.    
Stemming
High
*
*
11.    
Stop words
High
*
*
12.    
Synonyms
High
*
*
13.    
Protocols Support
High
*
*
14.    
“Find Similar” Feature
High
X
*
15.    
“Did you mean” Feature
High
X
*
16.    
Breed
Medium
*
*
17.    
Feature Customization
Medium
X
*
18.    
Auto Suggest
Medium
X
*
19.    
Geo Spatial Search
Medium
X
*
20.    
Algorithms
Low
X
*
21.    
Multilingual Support
Low
X
*


Observations:
  • Amazon Cloud Search scores overall well on most of the “High” priority features in comparison with Apache Solr, especially in infrastructure related features like scaling, partitioning etc. These infra features are essential for any online application which has heavy usage & dependence on the search tier. Usually activities like Scaling, Partitioning and Replication involve complex manual effort, planning and execution in the search tier.  Amazon CloudSearch eliminates this complexity and makes it for us by automating these essentials.
  • Manual effort involved in the above mentioned search infra activities translate directly to cost of training, managing and maintaining this tier with help of experts. These experts are usually costly!!!. Amazon CloudSearch with its automation brings down these manual efforts (thereby costs) significantly in comparison to expanding Apache Solr setups on EC2. This is an important aspect to be considered in the selection process of search tiers for your online applications. If your online application is constantly growing in terms of index and compute, then Amazon CloudSearch is the way to go compared to Apache Solr.
  • Amazon CloudSearch is well matured, robust and stable search service built on A9 search platform. For most of the online use cases like ecommerce, job search, documents search, content search etc it is more than sufficient.
  • IT teams of startups and mid-sized companies which are usually in short of technical staff (especially who cannot afford dedicated expertise for search tier) should first look into Amazon CloudSearch for their fitment. On the whole it will be a better package for them.
  • Enterprises & software vendors who are refining their products for AWS, should surely consider the merits of Amazon CloudSearch vs Apache Solr/MongoDB in their technical stack. In addition if their deployments have unpredictable or elastic load volatility, surely Amazon CloudSearch will be a top contender in cost savings.
  • Features like “Find similar” and “Did you mean” are generally used on search modules of Jobs and ecommerce applications. It is available in Apache Solr and surely good to have on Amazon CloudSearch. Though it is currently not available, i assume AWS might work on it if lots of customers are requesting for it. (+1 vote from me for this feature)
  • If you are looking to build a specialized search module with customizations, geo spatial and multilingual intelligence, currently the best choice is to use Apache Solr on Amazon EC2. Location aware applications and localized applications can use the Geo spatial and multilingual features of Apache Solr on EC2 easily (missing in Amazon CloudSearch).  I have also noticed patterns on AWS, where customers are using MongoDB for searching documents / geo spatial indexes last few years.  Though these requested features are little specific, Amazon CloudSearch surely should introduce them for wider use case adoption. (+1 vote from me for these features)
  • For Open source developers who are looking to extend/customize the functionalities of search tier Amazon CloudSearch is not recommended and Apache Solr is the best fit.
This article was co authored with Vijay. Linkedin handle : in.linkedin.com/in/vijayolety



Related Articles:

4 comments:

Anonymous said...

Why is the cost negative for Solr on EC2?

Unknown said...

Great article. Thanks for taking the time. It looks like each new CloudSearch (functionality) feature will take away another reason to delve into Solr.

Harish Ganesan said...

Please refer to this article for cost comparison http://harish11g.blogspot.in/2013/02/amazon-cloudsearch-solr-cost-comparison.html

Shalin Shekhar Mangar said...

Hi Harish, are you planning to update this article to take SolrCloud into account? A lot of sharding/replication related stuff is outdated.

Need Consulting help ?

Name

Email *

Message *

DISCLAIMER
All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.

Followers