Setting up Apache Hadoop Multi – Node Cluster
We are sharing our experience about Apache Hadoop Installation in Linux based machines (Multi-node). Here we will also share our experience about different troubleshooting also and make update in future.
User creation and other configurations step –
- We start by adding a dedicated Hadoop system user in each cluster.
$ sudo addgroup hadoop $ sudo adduser –ingroup hadoop hduser
- Next we configure the SSH (Secure Shell) on all the cluster to enable secure data communication.
user@node1:~$ su – hduser hduser@node1:~$ ssh-keygen -t rsa -P “”
The output will be something like the following:
Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu .....
- Next we need to enable SSH access to local machine with this newly created key:
hduser@node1:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Repeat the above steps in all the cluster nodes and test by executing the following statement
hduser@node1:~$ ssh localhost
This step is also needed to save local machine’s host key fingerprint to the hduser user’s known_hosts file.
Next we need to edit the /etc/hosts file in which we put the IPs and Name of each system in the cluster.
In our scenario we have one master (with IP 192.168.0.100) and one slave (with IP 192.168.0.101)
$ sudo vi /etc/hosts
and we put the values into the host file as key value pair.
192.168.0.100 master 192.168.0.101 slave
- Providing the SSH Access
The hduser user on the master node must be able to connect
- to its own user account on the master via ssh master in this context not necessarily ssh localhost.
- to the hduser account of the slave(s) via a password-less SSH login.
So we distribute the SSH public key of hduser@master to all its slave, (in our case we have only one slave. If you have more execute the following statement changing the machine name i.e. slave, slave1, slave2).
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
Try by connecting master to master and master to slave(s) and check if everything is fine.
Configuring Hadoop
- Let us edit the conf/masters (only in the masters node)
and we enter master into the file.
Doing this we have told Hadoop that start Namenode and secondary NameNodes in our multi-node cluster in this machine.
The primary NameNode and the JobTracker will always be on the machine we run bin/start-dfs.sh and bin/start-mapred.sh.
- Let us now edit the conf/slaves(only in the masters node) with
master slave
This means that, we try to run datanode process on master machine also – where the namenode is also running. We can leave master to act as slave if we have more machines as datanode at our disposal.
if we have more slaves, then to add one host per line like the following:
master slave slave2 slave3
etc….
Lets now edit two important files (in all the nodes in our cluster):
- conf/core-site.xml
- conf/core-hdfs.xml
1) conf/core-site.xml
We have to change the fs.default.parameter which specifies NameNode host and port. (In our case this is the master machine)
<property> <name>fs.default.name</name> <value>hdfs://master:54310</value> …..[Other XML Values] </property>
Create a directory into which Hadoop will store its data –
$ mkdir /app/hadoop
We have to ensure the directory is writeable by any user:
$ chmod 777 /app/hadoop
Modify core-site.xml once again to add the following property:
<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop</value> </property>
2) conf/core-hdfs.xml
We have to change the dfs.replication parameter which specifies default block replication. It defines how many machines a single file should be replicated to before it becomes available. If we set this to a value higher than the number of available slave nodes (more precisely, the number of DataNodes), we will start seeing a lot of “(Zero targets found, forbidden1.size=1)” type errors in the log files.
The default value of dfs.replication is 3. However, as we have only two nodes available (in our scenario), so we set dfs.replication to 2.
<property> <name>dfs.replication</name> <value>2</value> …..[Other XML Values] </property>
- Let us format the HDFS File System via NameNode.
Run the following command at master
bin/hadoop namenode -format
- Let us start the multi node cluster:
Run the command: (in our case we will run on the machine named as master)
bin/start-dfs.sh
Checking of Hadoop Status –
After everything has started run the jps command on all the nodes to see everything is running well or not.
In master node the desired output will be –
$ jps 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode
In Slave(s):
$ jps 15314 Jps 14880 DataNode
Ofcourse the Process IDs will vary from machine to machine.
Troubleshooting
It might be possible that Datanode might not get started in all our nodes. At this point if we see the
logs/hadoop-hduser-datanode-.log
on the effected nodes with the exception –
java.io.IOException: Incompatible namespaceIDs
In this case we need to do the following –
- Stop the full cluster, i.e. both MapReduce and HDFS layers.
- Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml. In our case, the relevant directory is /app/hadoop/tmp/dfs/data
- Reformat the NameNode. All HDFS data will be lost during the format perocess.
- Restart the cluster.
Or
We can manually update the namespaceID of problematic DataNodes:
- Stop the problematic DataNode(s).
- Edit the value of namespaceID in ${dfs.data.dir}/current/VERSION to match the corresponding value of the current NameNode in ${dfs.name.dir}/current/VERSION.
- Restart the fixed DataNode(s).
In Running Map-Reduce Job in Apache Hadoop (Multinode Cluster), we will share our experience about Map Reduce Job Running as per apache hadoop example.
Resources
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
- http://hadoop.apache.org/docs/current/
hi & thanks for this post .
Is it possible for you to do these steps on windows-based system ?
(i have this error : Cannot connect to the Map/Reduce location java.net.ConnectException … in eclipse -based on http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html )
thanks.
Sir at step
Configuring Hadoop
Let us edit the conf/masters (only in the masters node)
and we enter master into the file.
where is conf/masters file located in package or we have to create a file with what extension and how we have to write in it…
please clarifiy sir
Hello! Did you find the solution? Where is the folder? What I did wrong?
Good tutorial but without example
Also I have to face the issue disscussed in Troubleshooting part above. I solved it by changing the port numbers on slaves and master ( It was not mentioned above)
Shall i use the ssh login without password
http://www.linuxproblem.org/art_9.html
Nice and crisp article, I will try the multi node cluster. Single node cluster is beautifully explained in the following article.
http://mainframewizard.com/content/setup-single-node-hadoop-cluster
Hi:
If i want to have 1 namenode and 1 secondary namenode in diferent machine, where i type it???