Cloud, Big Data and Mobile: About Apache Chukwa

Friday, February 24, 2012

About Apache Chukwa

Chukwa is a data collection and Analysis Framework that works with Hadoop to process and analyze the huge logs generated. It is built on top of the Hadoop Distributed File System (HDFS) and Map Reduce Framework. It is highly flexible tool that makes Log analysis, processing and monitoring easier especially while handling Distributed File Systems like Hadoop.

Components of Chukwa

Chukwa comprises of the following components

Agents that run on each machine to collect the logs generated from various applications.

Collectors that receive data from the agent and write it to stable storage – HDFS in case of Hadoop

MapReduce jobs or parsing and archiving the data.

How does Chukwa Work ?

Agents :

Chukwa Agents run on every machine from where logs needs to transferred to Hadoop. The Agents collect Logs generated from the application layer using Adaptors. One agent can have many adaptors, each doing a separate task of collecting logs.

Collectors:

Collectors do the actual job of Collecting logs. As described in the figure above, every Application that is generating logs will have its own adaptor. The job of the Adaptor is to send the logs to the Agent which in turn forwards to the Collector. The Collector will save all these logs collected from various agents in a Data sink file in HDFS. The appending is done in the Chukwa side and hence we can successfully process huge amounts of logs. The data sinks will be renamed and moved after a threshold and logs will be sequentially written in next Data sink File.

Map Reduce Jobs in Chukwa:

As data is collected, Chukwa dumps it into sink files in HDFS. By default, these are located in hdfs:///chukwa/logs. If the file name ends in ‘.chukwa’, that means the file is still being written by it. Every few minutes, the collector will close the file, and rename the file to ‘*.done’. This marks the file as available for processing. Each sink file is a Hadoop sequence file, containing a succession of key-value pairs, and periodic synch markers to facilitate MapReduce access. Chukwa has its own set of Map Reduce jobs that do the processing of Logs. There are two basic types of jobs:

Archive Jobs:

The simple archiver is designed to consolidate a large number of data sink files into a small number of archive files, with the contents grouped in a useful way. Archive files, like raw sink files, are in Hadoop sequence file format. Unlike the data sink, however, duplicates have been removed. The simple archiver moves every .done file out of the sink, and then runs a MapReduce job to group the data. Output Chunks will be placed into files with names of the form hdfs:///chukwa/archive/clustername/Datatype_date.arc. Date corresponds to when the data was collected; Datatype is the datatype of each Chunk. If archived data corresponds to an existing file name, a new file will be created with a disambiguate suffix.

Demux Jobs:

A key use for Chukwa is processing arriving data, in parallel, using MapReduce. The most common way to do this is using the Chukwa demux framework. As data flows through chukwa, the demux job is often the first job that runs. By default, Chukwa will use the default TsProcessor. This parser will try to extract the real log statement from the log entry using the ISO8601 date format. If it fails, it will use the time at which the chunk was written to disk (collector timestamp).

Limitations

Dependency on Hadoop and Map-Reduce: Chukwa's core original mechanisms require Hadoop (HDFS) and MapReduce jobs. The Demux functionality of Chukwa especially runs a Map Reduce task internally to compute the key value pairs.

Implementation of Chukwa: Chukwa works with an Agent-Collector set up that works predominantly with a single collector until specified for a multi-collector set up. This is a drawback as compared to Log analysis systems like Flume which will spawn its own master nodes based on the set up.

Support for Zip Files: Chukwa does not have any support for gzip feature to zip the data files before or after storing data in the HDFS.

Production Release: In two years there has been zero production installations of Chukwa or no commercial support vendors.

Documentation Support: The Chukwa wiki is very much out of date and the mailing lists do not have answers for many queries in Chukwa.

Chukwa Advantages

Unlike other systems, Chukwa has a rich metadata model, meaning that semantically-meaningful subsets of data are processed together. This metadata is collected automatically and stored in parallel with data. This eases the development of parallel, scalable MapReduce analyses.

Chukwa can collect a variety of system metrics and can receive data via a variety of network protocols, including syslog.

Chukwa works with Hadoop Distributed File System and MapReduce to process its data and hence can easily scale to thousands of nodes in both collection and analysis and also provides a familiar framework for processing the collected data.

The components of Chukwa are pluggable and allows for easy customization and enhancement.

It can maintain latencies while imposing very modest overheads on the Hadoop Cluster.

In recovering from failures, Chukwa takes advantage of local copies of log files, on the machines where they are generated. This effectively pushes the responsibility for maintaining data out of the monitoring system, and into the local filesystem on each machine.

1 comment:

alfred said...: MySQL is part of the popular LAMP server configuration, Linux Apache MySQL PHP/Perl/Python. Today, it is undoubtedly the most popular and widely used database management system for building database driven websites. It provides a solid backend foundation for millions of websites globally.

apache jobs; November 2, 2012 at 10:40 PM

Cloud, Big Data and Mobile

Pages

Friday, February 24, 2012

About Apache Chukwa

Components of Chukwa

Limitations

1 comment:

Need Consulting help ?

Followers

My Presentations / Webinars / Conferences

Popular Posts - All Time

My Articles

SlideShares