Tuesday, February 21, 2012

Configuring Multi Node Hadoop Cluster - Cloudera Distribution

This article explains the detailed steps to configure Cloudera distribution of Hadoop in Cluster mode using multiple Hadoop slaves and single Hadoop master.
OS & Tools used in this setup:
  • OS: Ubuntu – 11.04
  • JVM: Sun JDK – 1.6.0_26
  • Hadoop: CDH3 (Cloudera’s Distribution – Apache Hadoop)
Note: Identify the machines to setup CDH3 in cluster mode. We have used 4 servers (2 Ubuntu & 2 Debian Servers – 1 machine as hadoop master, 3 machines as hadoop slave) in this example setup.
Our Setup:
1 hadoop master => ubuntu-server
3 hadoop slaves => ubuntu1-xen, debian1-xen, debian2-xen

1. Prerequisites



Note: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).
Step-1: Follow the instructions in this link.
Step-2: If the identified machines are in the same network and can be accessed using dns (qualified names) then skip this step else, edit the /etc/hosts file in all the identified machines and update them with the hosts information of all the identified machines. The changes that we did for our setup are shown below…
user1@ubuntu-server:~$ sudo vim /etc/hosts
user1@ubuntu1-xen:~$ sudo vim /etc/hosts
user1@debian1-xen:~$ sudo vim /etc/hosts
user1@debian2-xen:~$ sudo vim /etc/hosts
we used the following “hosts” information in our setup:
192.168.---.--- ubuntu-server
192.168.---.--- ubuntu1-xen
192.168.---.--- debian1-xen
192.168.---.--- debian2-xen

2. Setup CDH3



Note: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).
Follow the instructions in this link.

3. Configure CDH3 in Fully Distributed (or Cluster) Mode




Note1: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).
Note2: The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop-0.20/conf.
Step-1:  Copy hadoop default configuration and create a new configuration for cluster setup as shown below.
user1@ubuntu-server:~$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster
Step-2:  Install the newly created configuration folder using alternatives framework.
user1@ubuntu-server:~$ sudo update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50
Step-3:  Check if the configuration is set to newly created cluster conf.
user1@ubuntu-server:~$ sudo update-alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - auto mode
  link currently points to /etc/hadoop-0.20/conf.cluster
/etc/hadoop-0.20/conf.cluster - priority 50
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
Current 'best' version is '/etc/hadoop-0.20/conf.cluster'.
If it points to /etc/hadoop-0.20/conf.cluster then it is in cluster mode. If it points to something else and if you want to set the conf manually, use the following command to set the conf in cluster mode.
user1@ubuntu-server:~$ sudo update-alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster
Step-4: Edit the config file – /etc/hadoop-0.20/conf.cluster/core-site.xml as shown below.
user1@ubuntu-server:~$ sudo vim /etc/hadoop-0.20/conf.cluster/core-site.xml
Update the file with the below content:
<configuration>
  <property>
     <name>fs.default.name</name>
     <value>hdfs://ubuntu-server:10818</value>
   </property>
</configuration>
Property: fs.default.name
Description: The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
Default: file:///
Our Value: hdfs://ubuntu-server:10818/
Step-5: Edit the config file – /etc/hadoop-0.20/conf.cluster/hdfs-site.xml as shown below.
user1@ubuntu-server:~$ sudo vim /etc/hadoop-0.20/conf.cluster/hdfs-site.xml
Update the file with the below content:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/var/opt/cdh3/cluster/dfs/nn</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/var/opt/cdh3/cluster/dfs/dn</value>
    </property>
</configuration>
Property: dfs.replication
Description: Default block replication.
Default: 3
Our Value: 3
Property: dfs.name.dir
Description: Determines where on the local filesystem the DFS name node should store the name table (fsimage).
Default: ${hadoop.tmp.dir}/dfs/name
Our Value: /var/opt/cdh3/cluster/dfs/nn
Property: dfs.data.dir
Description: Determines where on the local filesystem an DFS data node should store its blocks.
Default: ${hadoop.tmp.dir}/dfs/data
Our Value: /var/opt/cdh3/cluster/dfs/dn
Step-6: Edit the config file – /etc/hadoop-0.20/conf.cluster/mapred-site.xml as shown below.
user1@ubuntu-server:~$ sudo vim /etc/hadoop-0.20/conf.cluster/mapred-site.xml
Update the file with the below content:
<configuration>
    <property>        
        <name>mapred.job.tracker</name>
        <value>ubuntu-server:10814</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/var/opt/cdh3/cluster/mapred/local</value>
    </property>
</configuration>
Property: mapred.job.tracker
Description: The host and port that the MapReduce job tracker runs at. If “local” – (standalone mode), then jobs are run in-process as a single map and reduce task.
Default: local
Our Value: ubuntu-server:10814
Property: mapred.local.dir
Description: The local directory where MapReduce stores intermediate data files.
Default: ${hadoop.tmp.dir}/mapred/local
Our Value: /var/opt/cdh3/cluster/mapred/local
Step-7: To setup /var/opt/cdh3/*/* folders which are used in Step-5 & Step-6 configuration, follow the commands below…
user1@ubuntu-server:~$ cd /var/opt
user1@ubuntu-server:/var/opt$ sudo mkdir -p cdh3/cluster/dfs/nn
user1@ubuntu-server:/var/opt$ sudo mkdir -p cdh3/cluster/dfs/dn
user1@ubuntu-server:/var/opt$ sudo mkdir -p cdh3/cluster/mapred/local
user1@ubuntu-server:/var/opt$ sudo chown -R root:hadoop cdh3
user1@ubuntu-server:/var/opt$ sudo chown -R hdfs:hadoop cdh3/cluster/dfs
user1@ubuntu-server:/var/opt$ sudo chown -R mapred:hadoop cdh3/cluster/mapred
user1@ubuntu-server:/var/opt$ sudo chmod -R 700 cdh3/cluster/dfs
user1@ubuntu-server:/var/opt$ sudo chmod -R 755 cdh3/cluster/mapred

4. Setup CDH3 Master



Note: Follow the steps explained below in the master machine (in our case – ubuntu-server).
Step-1: Install namenode, secondarynamenode, jobtracker daemons in the master (in our case – ubuntu-server machine).
user1@ubuntu-server:~$ sudo apt-get install hadoop-0.20-namenode
user1@ubuntu-server:~$ sudo apt-get install hadoop-0.20-secondarynamenode
user1@ubuntu-server:~$ sudo apt-get install hadoop-0.20-jobtracker

5. Setup CDH3 Slaves



Note: Follow the steps explained below in all the identified slave machines (in our case – ubuntu1-xen, debian1-xen, debian2-xen machines).
Step-1: Install datanode, tasktracker daemons as shown below…
user1@ubuntu1-xen:~$ sudo apt-get install hadoop-0.20-datanode
user1@ubuntu1-xen:~$ sudo apt-get install hadoop-0.20-tasktracker

6. Run CDH3 Cluster



Note: Follow as per the instruction in each step.
Step-1: Goto master machine (in our case, ubuntu-server machine) and format the namenode using hdfs user.
user1@ubuntu-server:~$ sudo -u hdfs hadoop namenode -format
11/12/06 20:39:17 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu-server/192.168.25.42
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2-cdh3u2
STARTUP_MSG:   build = file:///tmp/nightly_2011-10-13_20-02-02_3/hadoop-0.20-0.20.2+923.142-1~maverick -r 95a824e4005b2a94fe1c11f1ef9db4c672ba43cb; compiled by 'root' on Thu Oct 13 21:52:18 PDT 2011
************************************************************/
Re-format filesystem in /var/opt/cdh3/hadoop/cluster/dfs/nn ? (Y or N) Y
11/12/06 20:39:25 INFO util.GSet: VM type       = 32-bit
11/12/06 20:39:25 INFO util.GSet: 2% max memory = 17.77875 MB
11/12/06 20:39:25 INFO util.GSet: capacity      = 2^22 = 4194304 entries
11/12/06 20:39:25 INFO util.GSet: recommended=4194304, actual=4194304
11/12/06 20:39:25 INFO namenode.FSNamesystem: fsOwner=hdfs
11/12/06 20:39:25 INFO namenode.FSNamesystem: supergroup=supergroup
11/12/06 20:39:25 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/12/06 20:39:25 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
11/12/06 20:39:25 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
11/12/06 20:39:25 INFO common.Storage: Image file of size 110 saved in 0 seconds.
11/12/06 20:39:26 INFO common.Storage: Storage directory /var/opt/cdh3/hadoop/cluster/dfs/nn has been successfully formatted.
11/12/06 20:39:26 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu-server/192.168.25.42
************************************************************/
Step-2: Goto master (ubuntu-server) and start namenode, secondarynamenode, jobtracker daemons.
user1@ubuntu-server:~$ sudo service hadoop-0.20-namenode start
user1@ubuntu-server:~$ sudo service hadoop-0.20-secondarynamenode start
user1@ubuntu-server:~$ sudo service hadoop-0.20-jobtracker start
To check if all the started daemons are running, use the jps command as shown below…
user1@ubuntu-server:~$ sudo jps
This should list NameNode, JobTracker, SecondaryNameNode.
Step-3: Goto all the slave machines (ubuntu1-xen, debian1-xen, debian2-xen) and start datanode, tasktracker daemons.
user1@ubuntu1-xen:~$ sudo service hadoop-0.20-datanode start
user1@ubuntu1-xen:~$ sudo service hadoop-0.20-tasktracker start
To check if all the started daemons are running, use the jps command as shown below…
user1@ubuntu1-xen:~$ sudo jps
This should list DataNode, TaskTracker.
Step-4: Goto http://[master]:50070 to access HDFS information and goto http://[master]:50030 to access MapReduce (Job Tracker) information.


1 comment:

keamas said...

Hey I just set up my hadoop cluster like you described it, but I can not connect to the jobtracker http://[master]:50030

Can you maybe help me do you have an idea why ??

Need Consulting help ?

Name

Email *

Message *

DISCLAIMER
All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.

Followers