Saturday, February 18, 2012

What is Apache Hive ?

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies.
Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful information. It provides an SQL-like language called HiveQL while maintaining full support for map/reduce.
A HiveQL statement is submitted via the Command Line Interface (CLI), the web UI or an external client using the Apache Thrift, ODBC or JDBC interfaces. The driver first passes the query to the Compiler where it goes through the typical parse, type check and semantic analysis phases, using the metadata stored in the Metastore. An optimized plan in the form of a directed acyclic graph of map-reduce tasks and HDFS tasks is generated. The Execution Engine then executes these tasks in the order of their dependencies, using Hadoop. In short, a HiveQL query is divided into MapReduce tasks which run on the robust Hadoop framework.
Hive has access to files stored either directly in HDFS or in other data storage systems such as HBase. HBase tables can be accessed as if they were native Hive tables.  As a result, a single Hive query can perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. But the recommended approach is to pump the data from HBase (or any other storage) into Hive and then analyze it there. The benefit is 2-fold: (a) once the data is in Hive, queries against it will typically run faster (since Hive is optimized for warehousing) and (b) no risk of excessive load originating from cluster nodes onto the external storage system
Loading bulk data into Hadoop from production systems or accessing it from map-reduce applications running on large clusters can be a challenging task. Transferring data using scripts is inefficient and time-consuming. How do we efficiently move data from an external storage into Hive? Meet Apache Sqoop. Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from an external system on to HDFS, and populate tables in Hive and HBase. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

No comments:

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.