Cloud, Big Data and Mobile: Configuring Multi Node Hadoop Cluster

This article explains the detailed steps to configure Cloudera distribution of Hadoop in Cluster mode using multiple Hadoop slaves and single Hadoop master.

OS & Tools used in this setup:

OS: Ubuntu – 11.04
JVM: Sun JDK – 1.6.0_26
Hadoop: CDH3 (Cloudera’s Distribution – Apache Hadoop)

Note: Identify the machines to setup CDH3 in cluster mode. We have used 4 servers (2 Ubuntu & 2 Debian Servers – 1 machine as hadoop master, 3 machines as hadoop slave) in this example setup.

Our Setup:
1 hadoop master => ubuntu-server
3 hadoop slaves => ubuntu1-xen, debian1-xen, debian2-xen

1. Prerequisites

Note: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).

Step-1: Follow the instructions in this link.

Step-2: If the identified machines are in the same network and can be accessed using dns (qualified names) then skip this step else, edit the /etc/hosts file in all the identified machines and update them with the hosts information of all the identified machines. The changes that we did for our setup are shown below…

user1@ubuntu-server:~$ sudo vim /etc/hosts

user1@ubuntu1-xen:~$ sudo vim /etc/hosts

user1@debian1-xen:~$ sudo vim /etc/hosts

user1@debian2-xen:~$ sudo vim /etc/hosts

we used the following “hosts” information in our setup:

192.168.---.--- ubuntu-server
192.168.---.--- ubuntu1-xen
192.168.---.--- debian1-xen
192.168.---.--- debian2-xen

2. Setup CDH3

Note: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).

Follow the instructions in this link.

3. Configure CDH3 in Fully Distributed (or Cluster) Mode

Note1: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).

Note2: The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop-0.20/conf.

Step-1: Copy hadoop default configuration and create a new configuration for cluster setup as shown below.

user1@ubuntu-server:~$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster

Step-2: Install the newly created configuration folder using alternatives framework.

user1@ubuntu-server:~$ sudo update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50

Step-3: Check if the configuration is set to newly created cluster conf.

user1@ubuntu-server:~$ sudo update-alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - auto mode
  link currently points to /etc/hadoop-0.20/conf.cluster
/etc/hadoop-0.20/conf.cluster - priority 50
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
Current 'best' version is '/etc/hadoop-0.20/conf.cluster'.

If it points to /etc/hadoop-0.20/conf.cluster then it is in cluster mode. If it points to something else and if you want to set the conf manually, use the following command to set the conf in cluster mode.

user1@ubuntu-server:~$ sudo update-alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster

Step-4: Edit the config file – /etc/hadoop-0.20/conf.cluster/core-site.xml as shown below.

user1@ubuntu-server:~$ sudo vim /etc/hadoop-0.20/conf.cluster/core-site.xml

Update the file with the below content:

<configuration>
  <property>
     <name>fs.default.name</name>
     <value>hdfs://ubuntu-server:10818</value>
   </property>
</configuration>

Property: fs.default.name
Description: The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
Default: file:///
Our Value: hdfs://ubuntu-server:10818/

Step-5: Edit the config file – /etc/hadoop-0.20/conf.cluster/hdfs-site.xml as shown below.

user1@ubuntu-server:~$ sudo vim /etc/hadoop-0.20/conf.cluster/hdfs-site.xml

Update the file with the below content:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/var/opt/cdh3/cluster/dfs/nn</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/var/opt/cdh3/cluster/dfs/dn</value>
    </property>
</configuration>

Property: dfs.replication
Description: Default block replication.
Default: 3
Our Value: 3

Property: dfs.name.dir
Description: Determines where on the local filesystem the DFS name node should store the name table (fsimage).
Default: ${hadoop.tmp.dir}/dfs/name
Our Value: /var/opt/cdh3/cluster/dfs/nn

Property: dfs.data.dir
Description: Determines where on the local filesystem an DFS data node should store its blocks.
Default: ${hadoop.tmp.dir}/dfs/data
Our Value: /var/opt/cdh3/cluster/dfs/dn

Step-6: Edit the config file – /etc/hadoop-0.20/conf.cluster/mapred-site.xml as shown below.

user1@ubuntu-server:~$ sudo vim /etc/hadoop-0.20/conf.cluster/mapred-site.xml

Update the file with the below content:

<configuration>
    <property>        
        <name>mapred.job.tracker</name>
        <value>ubuntu-server:10814</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/var/opt/cdh3/cluster/mapred/local</value>
    </property>
</configuration>

Property: mapred.job.tracker
Description: The host and port that the MapReduce job tracker runs at. If “local” – (standalone mode), then jobs are run in-process as a single map and reduce task.
Default: local
Our Value: ubuntu-server:10814

Property: mapred.local.dir
Description: The local directory where MapReduce stores intermediate data files.
Default: ${hadoop.tmp.dir}/mapred/local
Our Value: /var/opt/cdh3/cluster/mapred/local

Step-7: To setup /var/opt/cdh3/*/* folders which are used in Step-5 & Step-6 configuration, follow the commands below…

user1@ubuntu-server:~$ cd /var/opt
user1@ubuntu-server:/var/opt$ sudo mkdir -p cdh3/cluster/dfs/nn
user1@ubuntu-server:/var/opt$ sudo mkdir -p cdh3/cluster/dfs/dn
user1@ubuntu-server:/var/opt$ sudo mkdir -p cdh3/cluster/mapred/local
user1@ubuntu-server:/var/opt$ sudo chown -R root:hadoop cdh3
user1@ubuntu-server:/var/opt$ sudo chown -R hdfs:hadoop cdh3/cluster/dfs
user1@ubuntu-server:/var/opt$ sudo chown -R mapred:hadoop cdh3/cluster/mapred
user1@ubuntu-server:/var/opt$ sudo chmod -R 700 cdh3/cluster/dfs
user1@ubuntu-server:/var/opt$ sudo chmod -R 755 cdh3/cluster/mapred

4. Setup CDH3 Master

Note: Follow the steps explained below in the master machine (in our case – ubuntu-server).

Step-1: Install namenode, secondarynamenode, jobtracker daemons in the master (in our case – ubuntu-server machine).

user1@ubuntu-server:~$ sudo apt-get install hadoop-0.20-namenode
user1@ubuntu-server:~$ sudo apt-get install hadoop-0.20-secondarynamenode
user1@ubuntu-server:~$ sudo apt-get install hadoop-0.20-jobtracker

5. Setup CDH3 Slaves

Note: Follow the steps explained below in all the identified slave machines (in our case – ubuntu1-xen, debian1-xen, debian2-xen machines).

Step-1: Install datanode, tasktracker daemons as shown below…

user1@ubuntu1-xen:~$ sudo apt-get install hadoop-0.20-datanode
user1@ubuntu1-xen:~$ sudo apt-get install hadoop-0.20-tasktracker

6. Run CDH3 Cluster

Note: Follow as per the instruction in each step.

Step-1: Goto master machine (in our case, ubuntu-server machine) and format the namenode using hdfs user.

user1@ubuntu-server:~$ sudo -u hdfs hadoop namenode -format
11/12/06 20:39:17 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu-server/192.168.25.42
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2-cdh3u2
STARTUP_MSG:   build = file:///tmp/nightly_2011-10-13_20-02-02_3/hadoop-0.20-0.20.2+923.142-1~maverick -r 95a824e4005b2a94fe1c11f1ef9db4c672ba43cb; compiled by 'root' on Thu Oct 13 21:52:18 PDT 2011
************************************************************/
Re-format filesystem in /var/opt/cdh3/hadoop/cluster/dfs/nn ? (Y or N) Y
11/12/06 20:39:25 INFO util.GSet: VM type       = 32-bit
11/12/06 20:39:25 INFO util.GSet: 2% max memory = 17.77875 MB
11/12/06 20:39:25 INFO util.GSet: capacity      = 2^22 = 4194304 entries
11/12/06 20:39:25 INFO util.GSet: recommended=4194304, actual=4194304
11/12/06 20:39:25 INFO namenode.FSNamesystem: fsOwner=hdfs
11/12/06 20:39:25 INFO namenode.FSNamesystem: supergroup=supergroup
11/12/06 20:39:25 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/12/06 20:39:25 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
11/12/06 20:39:25 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
11/12/06 20:39:25 INFO common.Storage: Image file of size 110 saved in 0 seconds.
11/12/06 20:39:26 INFO common.Storage: Storage directory /var/opt/cdh3/hadoop/cluster/dfs/nn has been successfully formatted.
11/12/06 20:39:26 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu-server/192.168.25.42
************************************************************/

Step-2: Goto master (ubuntu-server) and start namenode, secondarynamenode, jobtracker daemons.

user1@ubuntu-server:~$ sudo service hadoop-0.20-namenode start
user1@ubuntu-server:~$ sudo service hadoop-0.20-secondarynamenode start
user1@ubuntu-server:~$ sudo service hadoop-0.20-jobtracker start

To check if all the started daemons are running, use the jps command as shown below…

user1@ubuntu-server:~$ sudo jps

This should list NameNode, JobTracker, SecondaryNameNode.

Step-3: Goto all the slave machines (ubuntu1-xen, debian1-xen, debian2-xen) and start datanode, tasktracker daemons.

user1@ubuntu1-xen:~$ sudo service hadoop-0.20-datanode start
user1@ubuntu1-xen:~$ sudo service hadoop-0.20-tasktracker start

To check if all the started daemons are running, use the jps command as shown below…

user1@ubuntu1-xen:~$ sudo jps

This should list DataNode, TaskTracker.

Step-4: Goto http://[master]:50070 to access HDFS information and goto http://[master]:50030 to access MapReduce (Job Tracker) information.

Cloud, Big Data and Mobile

Pages

Tuesday, February 21, 2012

Configuring Multi Node Hadoop Cluster - Cloudera Distribution

1. Prerequisites

2. Setup CDH3

3. Configure CDH3 in Fully Distributed (or Cluster) Mode

4. Setup CDH3 Master

5. Setup CDH3 Slaves

6. Run CDH3 Cluster

1 comment:

Need Consulting help ?

Followers

My Presentations / Webinars / Conferences

Popular Posts - All Time

My Articles

SlideShares