Saturday, December 29, 2012

Part1: Comparison Analysis: Amazon CloudSearch vs Apache Solr

Search module plays an integral role in today’s websites and online applications. It is an entry point for many online business transactions and the user experience of these services are very critical. Search feature has the power to make or break a online business.  In this Multi Part series, we are going to compare Apache Solr - one of  the most popular OSS Search Engine with recently introduced Amazon CloudSearch web service in detail.
Breed :
Apache Solr is open source software. It is written entirely in Java and uses Lucene under the hood. Amazon CloudSearch is a proprietary creation and is based on Amazon’s time tested A9 technology. Highly specialized search solution companies like lucidworks, search technologies etc may prefer creating plugins and modules using open source code. Though open source gives us the flexibility and control, not many companies/business change the source and customize them often. In general, we have seen more business prefer using Search modules as appliances and services. Customers are happy if you provide them a robust service with well-defined feature set and relieve them from operation intricacies (like Amazon CloudSearch).  
Advantage:  Neutral
Feature Weight: Medium

Configuration & Getting started Effort:
Deep understanding and considerable Time & effort is needed to properly configure Apache Solr and get it up & running in Amazon EC2 infrastructure. It includes common tasks such as Apache Solr download, knowledge of Java, configuration of environment variables, deploying it in a server, properly configuring, understanding the admin commands, applying patches, tuning performance and upgrading to newer versions. This is just the start, when your application grows rapidly; you need to factor High Availability, scalability and partitions for the search tier as well. Application and IT teams needs to be aware of replication and sharding technologies of Apache Solr and configure the same depending upon the need. On the other hand, Amazon CloudSearch is a fully managed search service which offloads the administrative burden of operating your search tier.  You can get started with Amazon CloudSearch in few clicks using the AWS Management Console. Customers need not to worry about hardware provisioning, data partitioning, running out of disk space and planning of compute capacity or software patches.
Advantage: Amazon CloudSearch
Feature Weight: High

Multilingual Support :
Apache Solr has multilingual support. Custom analysers and tokenizers have to be written and plugged in for this functionality. One of the recommended approaches for using multilingualism in Apache Solr is to have a multi-core architecture with each core addressing one language.
Currently Amazon CloudSearch supports only English language for tokenizing words in the index. Though it is not a critical one, it is a good to have feature for applications which offers localized services for worldwide audience.
Advantage: Apache Solr
Feature Weight: Low

Faceted Search:
Faceting is one of the important features used in ecommerce website search modules. Faceting allows you to categorize your results into sub-groups, which can be used as the basis for another search. In recent times, faceting has gained popularity by allowing users to narrow down search results in an easy-to-use and efficient way.
Faceting can be best explained with the help of a picture (See figure below from As you can observe on left side of the figure, a search for “java programming” results in a lot of hits. You can clearly see that the search resulted in 3 facets (or sub-groups) using which you can narrow down your search. For example: if you click on “PDF” in the “Format” facet (see “Facet 2” in the figure), the search query now essentially means “java programming AND only pdf format”, thereby narrowing down the search space eventually leading you to better and convenient results. You can also observe that each member of a facet is accompanied by a number called Facet Count. In the “Format” facet, you can see “PDF (14)” which means that there are 14 “java programming” results in PDF format. The important aspect of this feature is that as you go deeper using facets, the resultant search space is vastly reduced and hence the search will be considerably faster.
Both Apache Solr and Amazon CloudSearch allow the user to perform faceting with minimal effort.

Advantage: Neutral
Feature Weight: High

Field Weighting / Boosting:
Field Weighting is a process of assigning different prominence's to the same word when present in different places in a document. For example when the phrase “Harry Potter” is present in the title of a document, it is ranked higher than when the same phrase is present in the References section of a document.     
Both Apache Solr and Amazon CloudSearch allow field boosting with minimal effort.
Advantage: Neutral
Feature Weight: High

Auto Suggest:
Often we find in many search boxes that when a user types a search query, suggestions of popular queries in relevance to the input are presented. Also we can find that the suggestion list is refined as additional characters are typed in by the user. This feature is called as Auto Suggest. This feature can be implemented at the Search Engine level or at the Search Application level.            
Apache Solr has the native support for autosuggest feature. It can be facilitated in many ways using – NGramFilterFactory, EdgeNGramFilterFactory or TermsComponent. Usually you can find this feature of Apache Solr is used in conjunction with jQuery for creating powerful auto suggestion experience in applications.
Amazon CloudSearch has no direct support for this autosuggest feature currently. We have to implement the same in our search application tier.
Advantage: Apache Solr
Feature Weight: Medium

Geospatial Search:
Consider an example - where a user performs a search for “Starbucks”, the search engine module must show the nearest outlet based on the user’s current location. Such location-aware searches will always produce significantly better results and helps the user in finding the right information more effectively and efficiently. This use-case signifies the importance of Geospatial search. In today’s mobile world, it is an important feature in many location aware business applications.
Apache Solr supports geospatial search through the implementation class solr.LatLonType. Actions such as sorting the results by distance and boosting documents by distance can be performed.
Amazon CloudSearch has a very limited geospatial search feature set. As of now, Amazon CloudSearch has the capability to return documents within a specific area. Missing features include sorting by geographical distance and faceting by distance.
Advantage: Apache Solr
Feature Weight:  Medium

“Find Similar” feature:
The search engine suggests similar records based on a particular record. It is similar to the “Find Similar” resumes feature used by popular job search engines. Ecommerce sites also benefit from this feature as research suggests that users typically compare products before making a transaction and are likely to buy a product which is better. Apache Solr implements this feature using handlers/components like MoreLikeThisHandler or MoreLikeThisComponent
Amazon CloudSearch currently does not support this feature.
Advantage: Apache Solr
Feature Weight: High 

1 comment:

Unknown said...

For "configuration and getting started," cloud hosted Solr services, such as my own can really help swing in Solr's favor on that front.

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.