Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed
This post contains instructions for Hadoop installation on ubuntu. This is a quick step by step tutorial of Hadoop installation. Here you will get all the commands and their description required to install Hadoop in Standalone mode (single node cluster), Hadoop in Pseudo distributed mode (single node cluster) and Hadoop in distributed mode (multi node cluster).
The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.
This Tutorial has been tested on:
- Ubuntu Linux (10.04 LTS)
- Hadoop 0.20.2
Prerequisites:
Install Java:
Java 1.6.x (either Sun Java or Open Java) is recommended for Hadoop
1. Add the Canonical Partner Repository to your apt repositories:
$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
2. Update the source list
$ sudo apt-get update
3. Install sun-java6-jdk
$ sudo apt-get install sun-java6-jdk
4. After installation, make a quick check whether Sun’s JDK is correctly set up:
user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
Adding a dedicated Hadoop system user:
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc)
$ sudo adduser hadoop_admin
Login to hadoop_admin User:
user@ubuntu:~$ su - hadoop_admin
Hadoop Installation:
$ cd /usr/local $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2
Define JAVA_HOME:
Edit configuration file /usr/local/hadoop-0.20.2/conf/hadoop-env.sh and set JAVA_HOME:
export JAVA_HOME=path to be the root of your Java installation(eg: /usr/lib/jvm/java-6-sun)
$ vi conf/hadoop-env.sh
Go your hadoop installation directory(HADOOP_HOME ie /usr/local/hadoop-0.20.2/):
$ bin/hadoop
It will generate following output:
Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client mradmin run a Map-Reduce admin client fsck run a DFS filesystem checking utility fs run a generic filesystem user client balancer run a cluster balancing utility jobtracker run the MapReduce job Tracker node pipes run a Pipes job tasktracker run a MapReduce task Tracker node job manipulate MapReduce jobs queue get information regarding JobQueues version print the version jar <jar> run a jar file distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME <src>*<dest> create a hadoop archive daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters:
Hadoop Setup in Standalone Mode is Completed…….!!!!!!!
Now lets run some examples:
1. Run Classic Pi example:
$ bin/hadoop jar hadoop-*-examples.jar pi 10 100
2. Run grep example:
$ mkdir input $ cp conf/*.xml input $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' $ cat output/*
3. Run word count example:
$ mkdir inputwords $ cp conf/*.xml inputwords $ bin/hadoop jar hadoop-*-examples.jar wordcount inputwords outputwords
If you find any error visit Hadoop troubleshooting
After Running Hadoop in Standalone mode Lets start Hadoop in Pseudo distributed mode(single node cluster):
Configuring SSH:
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hadoop_admin user
user@ubuntu:~$ su - hadoop_admin hadoop_admin@ubuntu:~$ sudo apt-get install openssh-server openssh-client hadoop_admin@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop_admin/.ssh/id_rsa): Created directory '/home/hadoop_admin/.ssh'. Your identification has been saved in /home/hadoop_admin/.ssh/id_rsa. Your public key has been saved in /home/hadoop_admin/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hadoop_admin@ubuntu The key's randomart image is: [...snipp...] hadoop_admin@ubuntu:~$
Enable SSH access to your local machine and connect using ssh
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is e7:89:26:49:ae:02:30:eb:1d:75:4f:bb:44:f9:36:29. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 30 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] $
Edit configuration files:
$ vi conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> </property> </configuration>
If you give some other path, ensure that hadoop_admin user have read, write permission in that directory (sudo chown hadoop_admin /your/path)
$ vi conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
$ vi conf/mapred.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
Formatting the name node:
$ /hadoop/bin/hadoop namenode -format
It will generate following output:
$ bin/hadoop namenode -format 10/05/10 16:59:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/10 16:59:56 INFO namenode.FSNamesystem: fsOwner=hadoop_admin,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../.../dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ $
STARTING SINGLE-NODE CLUSTER:
$ /bin/start-all.sh
It will generate following output:
hadoop_admin@ubuntu:/usr/local/hadoop$ bin/start-all.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-ubuntu.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-ubuntu.out hadoop_admin@ubuntu:/usr/local/hadoop$
check whether the expected Hadoop processes are running by jps
$ jps 14799 NameNode 14977 SecondaryNameNode 15183 DataNode 15596 JobTracker 15897 TaskTracker
Hadoop Setup in Pseudo Distributed Mode is Completed…….!!!!!!!
STOPPING SINGLE-NODE CLUSTER:
$ /bin/stop-all.sh
It Will generate following output:
$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode $
You can run the same set of exampels as in the standalone mode in oder to check if your installation is sucessfull
Web based Interface for NameNode
http://localhost:50070
Web based Interface for JobTracker
http://localhost:50030
Web based Interface for TaskTracker
http://localhost:50060
After Running Hadoop in Standalone mode Lets start Hadoop in distributed mode(multi node cluster)
Prerequisite: Before starting hadoop in distributed mode you must setup hadoop in pseudo distributed mode and you need at least two machines one for master and another for slave(you can create more then one virtual machine on a single machine).
COMMAND | DESCRIPTION |
---|---|
$ bin/stop-all.sh | Before starting hadoop in distributed mode first stop each cluster. run this cmd on all machines in cluster (master and slave) |
$ vi /etc/hosts | Then type IP-add master(eg: 192.168.0.1 master) IP-add slave(eg: 192.168.0.2 slave) run this cmd on all machines in cluster (master and slave) |
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave | setting passwordless ssh (on all the machines you must login with same user name) run this cmd on master |
or $ cat .ssh/id_rsa.pub Then Its content is then copied in | we can also set passwordless ssh manuall |
$ vi conf/master then type master | The conf/masters file defines the namenodes of our multi-node cluster run this cmd on master |
$ vi conf/slaves then type slave | This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run. run this cmd on all machines in cluster (master and slave) |
$ vi conf/core-site.xml then type: <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> </property> | Edit configuration file core-site.xml run this cmd on all machines in cluster (master and slave) |
$ vi conf/mapred-site.xml then type: <property> <name>mapred.job.tracker</name> <value>master:54311</value> </property> | Edit configuration file mapred-site.xml run this cmd on all machines in cluster (master and slave) |
$ vi conf/hdfs-site.xml then type: <property> <name>dfs.replication</name> <value>2</value> </property> | Edit configuration file hdfs-site.xml run this cmd on all machines in cluster (master and slave) |
$ vi conf/mapred-site.xml then type: <property> <name>mapred.local.dir</name> <value>${hadoop.tmp.dir}/mapred/local</value> <property> <name>mapred.map.tasks</name> <value>20</value> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> | Edit configuration file mapred-site.xml run this cmd on master |
$ bin/start-dfs.sh | Starting the multi-node cluster. First, the HDFS daemons are started. the namenode daemon is started on master, and datanode daemons are started on all slaves run this cmd on master |
$ jps |
It should give output like this: 14799 NameNode 15314 Jps 16977 secondaryNameNode run this cmd on master |
$ jps | It should give output like this: 15183 DataNode 15616 Jps run this cmd on all slaves |
$ bin/start-mapred.sh | The MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves run this cmd on master |
$ jps |
It should give output like this: 16017 Jps 14799 NameNode 15596 JobTracker 14977 SecondaryNameNode run this cmd on master |
$ jps | It should give output like this: 15183 DataNode 15897 TaskTracker 16284 Jps run this cmd on all slaves |
Congratulations Hadoop Setup is Completed | |
http://localhost:50070/ | web based interface for name node |
http://localhost:50030/ | web based interface for job tracker |
Now lets run some examples | |
$ bin/hadoop jar hadoop-*-examples.jar pi 10 100 | run pi example |
$ bin/hadoop dfs -mkdir input $ bin/hadoop dfs -put conf input $ bin/hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’ $ bin/hadoop dfs -cat output/* | run grep example |
$ bin/hadoop dfs -mkdir inputwords $ bin/hadoop dfs -put conf inputwords $ bin/hadoop jar hadoop-*-examples.jar wordcount inputwords outputwords $ bin/hadoop dfs -cat outputwords/* | run wordcount example |
$ bin/stop-mapred.sh $ bin/stop-dfs.sh | To stop the demons run this cmd on master |
Reference: Hadoop in Standalone Mode, Hadoop in Pseudo Distributed Mode & Hadoop in Distributed Mode from our JCG partner Rahul Patodi at the High Performance Computing blog.
Excellent tutorial. It made my day!.
Thanks!.
Thanks a lot. Explained very clearly
Great explained by Rahul, Really useful especially who is entering into Hadoop world. Thanks a lot to share and keep share forever. ..