Friday, February 24, 2012

About Apache Chukwa

Chukwa is a data collection and Analysis Framework that works with Hadoop to process and analyze the huge logs generated. It is built on top of the Hadoop Distributed File System (HDFS) and Map Reduce Framework. It is highly flexible tool that makes Log analysis, processing and monitoring easier especially while handling Distributed File  Systems like Hadoop.

Components of Chukwa

Chukwa comprises of the following components
Agents that run on each machine to collect the logs generated from various applications.
Collectors that receive data from the agent and write it to stable storage – HDFS in case of Hadoop
MapReduce jobs or parsing and archiving the data.

How does Chukwa Work ?

Agents :
Chukwa Agents run on every machine from where logs needs to transferred to Hadoop. The Agents collect Logs generated from the application layer using Adaptors. One agent can have many adaptors, each doing a separate task of collecting logs. 

Collectors do the actual job of Collecting logs. As described in the figure above, every Application that is generating logs will have its own adaptor. The job of the Adaptor is to send the logs to the Agent which in turn forwards to the Collector. The Collector will save all these logs collected from various agents in a Data sink file in HDFS.  The appending is done in the Chukwa side and hence we can successfully process huge amounts of logs. The data sinks will be renamed and moved after a threshold and logs will be sequentially written in next Data sink File.

Map Reduce Jobs in Chukwa: 
As data is collected, Chukwa dumps it into sink files in HDFS. By default, these are located in hdfs:///chukwa/logs. If the file name ends in ‘.chukwa’, that means the file is still being written by it. Every few minutes, the collector will close the file, and rename the file to ‘*.done’. This marks the file as available for processing.  Each sink file is a Hadoop sequence file, containing a succession of key-value pairs, and periodic synch markers to facilitate MapReduce access. Chukwa has its own set of Map Reduce jobs that do the processing of Logs. There are two basic types of jobs:

Archive Jobs:
The simple archiver is designed to consolidate a large number of data sink files into a small number of archive files, with the contents grouped in a useful way. Archive files, like raw sink files, are in Hadoop sequence file format. Unlike the data sink, however, duplicates have been removed. The simple archiver moves every .done file out of the sink, and then runs a MapReduce job to group the data. Output Chunks will be placed into files with names of the form hdfs:///chukwa/archive/clustername/Datatype_date.arc. Date corresponds to when the data was collected; Datatype is the datatype of each Chunk. If archived data corresponds to an existing file name, a new file will be created with a disambiguate suffix.

Demux Jobs:
A key use for Chukwa is processing arriving data, in parallel, using MapReduce. The most common way to do this is using the Chukwa demux framework. As data flows through chukwa, the demux job is often the first job that runs. By default, Chukwa will use the default TsProcessor. This parser will try to extract the real log statement from the log entry using the ISO8601 date format. If it fails, it will use the time at which the chunk was written to disk (collector timestamp).


  • Dependency on Hadoop and Map-Reduce: Chukwa's core original mechanisms require Hadoop (HDFS) and MapReduce jobs. The Demux functionality of Chukwa especially runs a Map Reduce task internally to compute the key value pairs.
  • Implementation of Chukwa: Chukwa works with an Agent-Collector set up that works predominantly with a single collector until specified for a multi-collector set up. This is a drawback as compared to Log analysis systems like Flume which will spawn its own master nodes based on the set up.
  • Support for Zip Files: Chukwa does not have any support for gzip feature to zip the data files before or after storing data in the HDFS.
  • Production Release: In two years there has been zero production installations of Chukwa or no commercial support vendors.
  • Documentation Support: The Chukwa wiki is very much out of date and the mailing lists do not have answers for many queries in Chukwa.

Chukwa Advantages

  • Unlike other systems, Chukwa has a rich metadata model, meaning that semantically-meaningful subsets of data are processed together. This metadata is collected automatically and stored in parallel with data. This eases the development of parallel, scalable MapReduce analyses.
  • Chukwa can collect a variety of system metrics and can receive data via a variety of network protocols, including syslog.
  • Chukwa works with Hadoop Distributed File System and MapReduce to process its data and hence can easily scale to thousands of nodes in both collection and analysis and also provides a familiar framework for processing the collected data.
  • The components of Chukwa are pluggable and allows for easy customization and enhancement.
  • It can maintain latencies while imposing very modest overheads on the Hadoop Cluster.
  • In recovering from failures, Chukwa takes advantage of local copies of log files, on the machines where they are generated. This effectively pushes the responsibility for maintaining data out of the monitoring system, and into the local filesystem on each machine.

1 comment:

alfred said...

MySQL is part of the popular LAMP server configuration, Linux Apache MySQL PHP/Perl/Python. Today, it is undoubtedly the most popular and widely used database management system for building database driven websites. It provides a solid backend foundation for millions of websites globally.

apache jobs

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.