This article explains the detailed steps to configure Cloudera distribution of Hadoop in Cluster mode using multiple Hadoop slaves and single Hadoop master.
OS & Tools used in this setup:
- OS: Ubuntu – 11.04
- JVM: Sun JDK – 1.6.0_26
- Hadoop: CDH3 (Cloudera’s Distribution – Apache Hadoop)
Note: Identify the machines to setup CDH3 in cluster mode. We have used 4 servers (2 Ubuntu & 2 Debian Servers – 1 machine as hadoop master, 3 machines as hadoop slave) in this example setup.
Our Setup:
1 hadoop master => ubuntu-server
3 hadoop slaves => ubuntu1-xen, debian1-xen, debian2-xen
1 hadoop master => ubuntu-server
3 hadoop slaves => ubuntu1-xen, debian1-xen, debian2-xen
1. Prerequisites
Note: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).
Step-1: Follow the instructions in this link.
Step-2: If the identified machines are in the same network and can be accessed using dns (qualified names) then skip this step else, edit the /etc/hosts file in all the identified machines and update them with the hosts information of all the identified machines. The changes that we did for our setup are shown below…
we used the following “hosts” information in our setup:
2. Setup CDH3
Note: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).
Follow the instructions in this link.
3. Configure CDH3 in Fully Distributed (or Cluster) Mode
Note1: Follow the steps explained below in all the identified machines (both master and all the slaves – in our case, ubuntu-server, ubuntu1-xen, debian1-xen, debian2-xen machines).
Note2: The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop-0.20/conf.
Step-1: Copy hadoop default configuration and create a new configuration for cluster setup as shown below.
Step-2: Install the newly created configuration folder using alternatives framework.
Step-3: Check if the configuration is set to newly created cluster conf.
If it points to /etc/hadoop-0.20/conf.cluster then it is in cluster mode. If it points to something else and if you want to set the conf manually, use the following command to set the conf in cluster mode.
Step-4: Edit the config file – /etc/hadoop-0.20/conf.cluster/core-site.xml as shown below.
Update the file with the below content:
Property: fs.default.name
Description: The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
Default: file:///
Our Value: hdfs://ubuntu-server:10818/
Description: The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
Default: file:///
Our Value: hdfs://ubuntu-server:10818/
Step-5: Edit the config file – /etc/hadoop-0.20/conf.cluster/hdfs-site.xml as shown below.
Update the file with the below content:
Property: dfs.replication
Description: Default block replication.
Default: 3
Our Value: 3
Description: Default block replication.
Default: 3
Our Value: 3
Property: dfs.name.dir
Description: Determines where on the local filesystem the DFS name node should store the name table (fsimage).
Default: ${hadoop.tmp.dir}/dfs/name
Our Value: /var/opt/cdh3/cluster/dfs/nn
Description: Determines where on the local filesystem the DFS name node should store the name table (fsimage).
Default: ${hadoop.tmp.dir}/dfs/name
Our Value: /var/opt/cdh3/cluster/dfs/nn
Property: dfs.data.dir
Description: Determines where on the local filesystem an DFS data node should store its blocks.
Default: ${hadoop.tmp.dir}/dfs/data
Our Value: /var/opt/cdh3/cluster/dfs/dn
Description: Determines where on the local filesystem an DFS data node should store its blocks.
Default: ${hadoop.tmp.dir}/dfs/data
Our Value: /var/opt/cdh3/cluster/dfs/dn
Step-6: Edit the config file – /etc/hadoop-0.20/conf.cluster/mapred-site.xml as shown below.
Update the file with the below content:
Property: mapred.job.tracker
Description: The host and port that the MapReduce job tracker runs at. If “local” – (standalone mode), then jobs are run in-process as a single map and reduce task.
Default: local
Our Value: ubuntu-server:10814
Description: The host and port that the MapReduce job tracker runs at. If “local” – (standalone mode), then jobs are run in-process as a single map and reduce task.
Default: local
Our Value: ubuntu-server:10814
Property: mapred.local.dir
Description: The local directory where MapReduce stores intermediate data files.
Default: ${hadoop.tmp.dir}/mapred/local
Our Value: /var/opt/cdh3/cluster/mapred/local
Description: The local directory where MapReduce stores intermediate data files.
Default: ${hadoop.tmp.dir}/mapred/local
Our Value: /var/opt/cdh3/cluster/mapred/local
Step-7: To setup /var/opt/cdh3/*/* folders which are used in Step-5 & Step-6 configuration, follow the commands below…
4. Setup CDH3 Master
Note: Follow the steps explained below in the master machine (in our case – ubuntu-server).
Step-1: Install namenode, secondarynamenode, jobtracker daemons in the master (in our case – ubuntu-server machine).
5. Setup CDH3 Slaves
Note: Follow the steps explained below in all the identified slave machines (in our case – ubuntu1-xen, debian1-xen, debian2-xen machines).
Step-1: Install datanode, tasktracker daemons as shown below…
6. Run CDH3 Cluster
Note: Follow as per the instruction in each step.
Step-1: Goto master machine (in our case, ubuntu-server machine) and format the namenode using hdfs user.
Step-2: Goto master (ubuntu-server) and start namenode, secondarynamenode, jobtracker daemons.
To check if all the started daemons are running, use the jps command as shown below…
This should list NameNode, JobTracker, SecondaryNameNode.
Step-3: Goto all the slave machines (ubuntu1-xen, debian1-xen, debian2-xen) and start datanode, tasktracker daemons.
To check if all the started daemons are running, use the jps command as shown below…
This should list DataNode, TaskTracker.
Step-4: Goto http://[master]:50070 to access HDFS information and goto http://[master]:50030 to access MapReduce (Job Tracker) information.
1 comment:
Hey I just set up my hadoop cluster like you described it, but I can not connect to the jobtracker http://[master]:50030
Can you maybe help me do you have an idea why ??
Post a Comment