Author: Rafal Kuc
Reviewed by : Harish Ganesan and Vijay Olety
Book URL : http://www.packtpub.com/apache-solr-4-cookbook/book
I recently read this book andi am really impressed! This book provides good understanding of Apache Solr for both developers as well as consultants.
The book starts off well with an introduction to Apache Solr, the web / app servers required, the role of Zookeeper, why clustering your data is vital ?, the various directory implementations, performance-oriented caching mechanisms, a sample crawler module which coupled with Solr gives a complete end-to-end solution, the role of Apache Tika as a extracting toolkit and the ease of customizing Solr. From then on, the book dwells into the details.
The first step is indexing. It plays a vital part in the entire search solution. Data can be in the form on .txt, .pdf or any other format. It is imperative that all such formats are easily indexable. One of the widely used tools for extracting metadata and language detection is Apache Tika. Data can also be present in a database, for which the Data Import Handler is handy. It comes in two variants – full and delta. Every detail is nicely explained with examples which can make the development time faster. DIH also helps us to modify the data while importing which I felt is a pretty neat feature! One of the nicest features included in Solr 4 is the ability to update single field in a document. I am not sure why this was included in the earlier versions but it’s a classic case of better late than never.
The next step in the pipeline is the data analysis which is achieved through the use of analyzers and tokenizers. Various use cases include elimination HTML and XML tags, copying the contents of one field to another and stemming words amongst others. The detailing that has gone into explaining every concept, the examples and the associated step-by-step explanation is really helpful.
Now that the data is indexed and the data preparation is completed, it’s time to query Apache Solr! Searches can be performed on individual words or on a phrase. You can boost or elevate certain documents over others based on your requirements. Simple concepts such as sorting and faceting of results to complex ones such as ignoring typos using n-grams and detecting duplicates are very simple to understand and perform. Faceting, in particular, is gaining momentum as it helps in implementing the auto-suggest feature and narrowing down the search criteria. A newly introduced feature called the pivot faceting was a much needed one and it vastly simplifies certain use cases related to faceting. Solr provides immense capabilities when it comes to querying and this book explains each of them in great detail taking real-world examples.
We indexed and queried the data. But as our application scales, we have to get our hands dirty and start fine-tuning the performance metrics in order to give a good user experience to our customers. This is where caches and its various flavors and granularities starts to make sense. Cache always plays a major role in any deployment and it is necessary to monitor Solr at all times to gauge its performance. This book can done a great job in clearly explaining the various types of caches, the commit operation and its impact on searchers and how to overcome these. This topic is really important for any Solr real-world deployment and this book has not let me down!
Apache Solr 4.0 introduced the most-awaited SolrCloud feature that allows us to use distributed indexing and searching. Setting up of SolrCloud cluster along with a Zookeeper ensemble to enable replication, fault-tolerance and high availability along with disaster recovery is a piece-of-a-cake now. I really appreciate the time and effort spent on documenting and explaining how to set up two collections inside a single cluster. It was a nightmare to find information on this particular topic when we implementing SolrCloud for one of our customers. But I am rest assured that others referring this book will save precious time of theirs. Adding / deleting nodes from a cluster is no longer a tedious task as the entire process is automated through the presence of Zookeeper nodes. The in-depth knowledge of the author in these topics is clearly visible and is of great help to all the readers. A touch on Zookeeper Rolling Restart, though off-topic, might enable readers to get a complete birds-eye view of the entire cluster. Certain features such as soft commit and NRT search have been explained in detail afterwards (under Real-life Situations) but I felt that at least a mention earlier on would have provided a much needed continuity in that section. For the geeky readers like me, a detailed description about load balancing across shards and replicas and their customizations, if any, would have added an extra amount of spice to this well-cooked food!
As with any other tool, Solr deployment too will run into some kind of a problem. This section details the common problems that are encountered and effective ways to overcome these. Shrinking the size of the index and allocating enough memory in advance amongst others are some of the solutions explained in detail and is clearly documented in this book.
Lastly, as every developer would have wanted it, the real-world scenarios are described and the various Solr concepts that were explained in earlier sections are put together as part of a complete end-to-end solution.
Any one trying Solr 4.0 must read this book in its entirety before recommending a Solr production architecture. As mentioned above, there are a few suggestions which if incorporated in this book would benefit readers. All in all, this book will be really helpful for the developers and consultants alike!