Author: Rafal Kuc
Reviewed by : Harish Ganesan and Vijay Olety
Book URL : http://www.packtpub.com/apache-solr-4-cookbook/book
Comments :
The
book starts off well with an introduction to Apache Solr, the web / app servers
required, the role of Zookeeper, why clustering your data is vital ?, the various
directory implementations, performance-oriented caching mechanisms, a sample
crawler module which coupled with Solr gives a complete end-to-end solution, the
role of Apache Tika as a extracting toolkit and the ease of customizing Solr.
From then on, the book dwells into the details.
The
first step is indexing. It plays a vital part in the entire search solution. Data
can be in the form on .txt, .pdf or any other format. It is imperative that all
such formats are easily indexable. One of the widely used tools for extracting
metadata and language detection is Apache Tika. Data can also be present in a
database, for which the Data Import Handler is handy. It comes in two variants
– full and delta. Every detail is nicely explained with examples which can make the development time faster. DIH also helps us to modify the data while
importing which I felt is a pretty neat feature! One of the nicest features
included in Solr 4 is the ability to update single field in a document. I am
not sure why this was included in the earlier versions but it’s a classic case
of better late than never.
The
next step in the pipeline is the data analysis which is achieved through the
use of analyzers and tokenizers. Various use cases include elimination HTML and
XML tags, copying the contents of one field to another and stemming words
amongst others. The detailing that has gone into explaining every concept, the
examples and the associated step-by-step explanation is really helpful.
Now
that the data is indexed and the data preparation is completed, it’s time to
query Apache Solr! Searches can be performed on individual words or on a phrase. You
can boost or elevate certain documents over others based on your requirements. Simple
concepts such as sorting and faceting of results to complex ones such as
ignoring typos using n-grams and detecting duplicates are very simple to understand
and perform. Faceting, in particular, is gaining momentum as it helps in implementing
the auto-suggest feature and narrowing down the search criteria. A newly
introduced feature called the pivot faceting was a much needed one and it
vastly simplifies certain use cases related to faceting. Solr provides immense
capabilities when it comes to querying and this book explains each of them in
great detail taking real-world examples.
We
indexed and queried the data. But as our application scales, we have to get our
hands dirty and start fine-tuning the performance metrics in order to give a
good user experience to our customers. This is where caches and its various
flavors and granularities starts to make sense. Cache always plays a major role
in any deployment and it is necessary to monitor Solr at all times to gauge its
performance. This book can done a great job in clearly explaining the various
types of caches, the commit operation and its impact on searchers and how to
overcome these. This topic is really important for any Solr real-world
deployment and this book has not let me down!
Apache Solr 4.0 introduced the most-awaited
SolrCloud feature that allows us to use distributed indexing and searching. Setting
up of SolrCloud cluster along with a Zookeeper ensemble to enable replication,
fault-tolerance and high availability along with disaster recovery is a
piece-of-a-cake now. I really appreciate the time and effort spent on
documenting and explaining how to set up two collections inside a single
cluster. It was a nightmare to find information on this particular topic when we implementing SolrCloud for one of our customers. But I
am rest assured that others referring this book will save precious time of
theirs. Adding / deleting nodes from a cluster is no longer a tedious task as
the entire process is automated through the presence of Zookeeper nodes. The
in-depth knowledge of the author in these topics is clearly visible and is of
great help to all the readers. A touch on Zookeeper Rolling Restart, though
off-topic, might enable readers to get a complete birds-eye view of the entire
cluster. Certain features such as soft commit and NRT search have been
explained in detail afterwards (under Real-life Situations) but I felt that at
least a mention earlier on would have provided a much needed continuity in that section. For
the geeky readers like me, a detailed description about load balancing across
shards and replicas and their customizations, if any, would have added an extra
amount of spice to this well-cooked food!
As with any other tool, Solr
deployment too will run into some kind of a problem. This section details the
common problems that are encountered and effective ways to overcome these. Shrinking
the size of the index and allocating enough memory in advance amongst others
are some of the solutions explained in detail and is clearly documented in this
book.
Lastly, as every developer would
have wanted it, the real-world scenarios are described and the various Solr
concepts that were explained in earlier sections are put together as part of a
complete end-to-end solution.
Any one trying Solr 4.0 must read this
book in its entirety before recommending a Solr production architecture. As
mentioned above, there are a few suggestions which if incorporated in this book
would benefit readers. All in all, this book will be really helpful for the developers and
consultants alike!
No comments:
Post a Comment