Running Map-Reduce Job in Apache Hadoop (Multinode Cluster)

Piyas DeJune 11th, 2013Last Updated: June 11th, 2013

1 126 4 minutes read

We will describe here the process to run MapReduce Job in Apache Hadoop in multinode cluster. To set up Apache Hadoop in Multinode Cluster, one can read Setting up Apache Hadoop Multi – Node Cluster.

For setting up we have to configure the hadoop with the following in each machine:

Add the following property in conf/mapred-site.xml in all the nodes

<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
 
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>
 
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
</property>
 
<property>
<name>mapred.map.tasks</name>
<value>20</value>
</property>
 
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>

N.B. The last three are additional setting, so we can omit them.

The Gutenberg Project

For our demo purpose of MapReduce we will be using the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

Download the example inputs from the following sites, and all e-texts should be in plain text us-ascii encoding.

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
The Notebooks of Leonardo Da Vinci
Ulysses by James Joyce
The Art of War by 6th cent. B.C. Sunzi
The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
The Devil’s Dictionary by Ambrose Bierce
Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3

Please google for those texts. Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg. Check the files with the following command:

1	`$` `ls` `-l` `/tmp/gutenberg/`

Next we start the dfs and mapred layer in our cluster

$ start-dfs.sh
 
$ start-mapred.sh

Check by issuing the following command jps to check the datanodes, namenodes, and task tracker, job tracker is all up and running fine in all the nodes.

Next we will copy the local files(here the text files) to Hadoop HDFS

$ hadoop dfs -copyFromLocal /tmp/gutenberg /Users/hduser/gutenberg
 
$ hadoop dfs -ls /Users/hduser

If the files are successfully copied we will see something like below – Found 2 items

1	`drwxr-xr-x – hduser supergroup 0 2013-05-21 14:48` `/Users/hduser/gutenberg`

Furthermore we check our file system whats in /Users/hduser/gutenberg:

1	`$ hadoop dfs -ls` `/Users/hduser/gutenberg`

Found 7 items
 
-rw-r--r-- 2 hduser supergroup 336705 2013-05-21 14:48 /Users/hduser/gutenberg/pg132.txt
-rw-r--r-- 2 hduser supergroup 581877 2013-05-21 14:48 /Users/hduser/gutenberg/pg1661.txt
-rw-r--r-- 2 hduser supergroup 1916261 2013-05-21 14:48 /Users/hduser/gutenberg/pg19699.txt
-rw-r--r-- 2 hduser supergroup 674570 2013-05-21 14:48 /Users/hduser/gutenberg/pg20417.txt
-rw-r--r-- 2 hduser supergroup 1540091 2013-05-21 14:48 /Users/hduser/gutenberg/pg4300.txt
-rw-r--r-- 2 hduser supergroup 447582 2013-05-21 14:48 /Users/hduser/gutenberg/pg5000.txt
-rw-r--r-- 2 hduser supergroup 384408 2013-05-21 14:48 /Users/hduser/gutenberg/pg972.txt

We start our MapReduce Job

Let us run the MapReduce WordCount example:

$ hadoop jar hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output

N.B.: Assuming that you are already in the HADOOP_HOME dir. If not then,

$ hadoop jar ABSOLUTE/PATH/TO/HADOOP/DIR/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output

Or if you have installed the Hadoop in /usr/local/hadoop then,

hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output

The output is follows something like:

13/05/22 13:12:13 INFO mapred.JobClient:  map 0% reduce 0%
13/05/22 13:12:59 INFO mapred.JobClient:  map 28% reduce 0%
13/05/22 13:13:05 INFO mapred.JobClient:  map 57% reduce 0%
13/05/22 13:13:11 INFO mapred.JobClient:  map 71% reduce 0%
13/05/22 13:13:20 INFO mapred.JobClient:  map 85% reduce 0%
13/05/22 13:13:26 INFO mapred.JobClient:  map 100% reduce 0%
13/05/22 13:13:43 INFO mapred.JobClient:  map 100% reduce 50%
13/05/22 13:13:55 INFO mapred.JobClient:  map 100% reduce 100%
13/05/22 13:13:59 INFO mapred.JobClient:  map 85% reduce 100%
13/05/22 13:14:02 INFO mapred.JobClient:  map 100% reduce 100%
13/05/22 13:14:07 INFO mapred.JobClient: Job complete: job_201305211616_0011
13/05/22 13:14:07 INFO mapred.JobClient: Counters: 26
13/05/22 13:14:07 INFO mapred.JobClient:   Job Counters 
13/05/22 13:14:07 INFO mapred.JobClient:     Launched reduce tasks=3
13/05/22 13:14:07 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=118920
13/05/22 13:14:07 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/22 13:14:07 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/05/22 13:14:07 INFO mapred.JobClient:     Launched map tasks=10
13/05/22 13:14:07 INFO mapred.JobClient:     Data-local map tasks=10
13/05/22 13:14:07 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=54620
13/05/22 13:14:07 INFO mapred.JobClient:   File Output Format Counters 
13/05/22 13:14:07 INFO mapred.JobClient:     Bytes Written=1267287
13/05/22 13:14:07 INFO mapred.JobClient:   FileSystemCounters
13/05/22 13:14:07 INFO mapred.JobClient:     FILE_BYTES_READ=4151123
13/05/22 13:14:07 INFO mapred.JobClient:     HDFS_BYTES_READ=5882320
13/05/22 13:14:07 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6937084
13/05/22 13:14:07 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1267287
13/05/22 13:14:07 INFO mapred.JobClient:   File Input Format Counters 
13/05/22 13:14:07 INFO mapred.JobClient:     Bytes Read=5881494
13/05/22 13:14:07 INFO mapred.JobClient:   Map-Reduce Framework
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce input groups=114901
13/05/22 13:14:07 INFO mapred.JobClient:     Map output materialized bytes=2597630
13/05/22 13:14:07 INFO mapred.JobClient:     Combine output records=178795
13/05/22 13:14:07 INFO mapred.JobClient:     Map input records=115251
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce shuffle bytes=1857123
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce output records=114901
13/05/22 13:14:07 INFO mapred.JobClient:     Spilled Records=463427
13/05/22 13:14:07 INFO mapred.JobClient:     Map output bytes=9821180
13/05/22 13:14:07 INFO mapred.JobClient:     Total committed heap usage (bytes)=1567514624
13/05/22 13:14:07 INFO mapred.JobClient:     Combine input records=1005554
13/05/22 13:14:07 INFO mapred.JobClient:     Map output records=1005554
13/05/22 13:14:07 INFO mapred.JobClient:     SPLIT_RAW_BYTES=826
13/05/22 13:14:07 INFO mapred.JobClient:     Reduce input records=178795

Retrieving the Job Result

To read directly from hadoop without copying to local file system:

1	`$ hadoop dfs -cat` `/Users/hduser/gutenberg-output/part-r-00000`

Let us copy the the results to the local file system though.

$ mkdir /tmp/gutenberg-output
 
$ bin/hadoop dfs -getmerge /Users/hduser/gutenberg-output /tmp/gutenberg-output
 
$ head /tmp/gutenberg-output/gutenberg-output

We will get a output as:

  
"'Ample.' 1
"'Arthur!' 1
"'As 1
"'Because 1
"'But,' 1
"'Certainly,' 1
"'Come, 1
"'DEAR 1
"'Dear 2
"'Dearest 1
"'Don't 1
"'Fritz! 1
"'From 1
"'Have 1
"'Here 1
"'How 2

The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.