Running Map-Reduce Job in Apache Hadoop (Multinode Cluster)
We will describe here the process to run MapReduce Job in Apache Hadoop in multinode cluster. To set up Apache Hadoop in Multinode Cluster, one can read Setting up Apache Hadoop Multi – Node Cluster.
For setting up we have to configure the hadoop with the following in each machine:
- Add the following property in conf/mapred-site.xml in all the nodes
<property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.local.dir</name> <value>${hadoop.tmp.dir}/mapred/local</value> </property> <property> <name>mapred.map.tasks</name> <value>20</value> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> </property>
N.B. The last three are additional setting, so we can omit them.
- The Gutenberg Project
For our demo purpose of MapReduce we will be using the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
Download the example inputs from the following sites, and all e-texts should be in plain text us-ascii encoding.
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
- The Art of War by 6th cent. B.C. Sunzi
- The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
- The Devil’s Dictionary by Ambrose Bierce
- Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3
Please google for those texts. Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg. Check the files with the following command:
$ ls -l /tmp/gutenberg/
- Next we start the dfs and mapred layer in our cluster
$ start-dfs.sh $ start-mapred.sh
Check by issuing the following command jps to check the datanodes, namenodes, and task tracker, job tracker is all up and running fine in all the nodes.
- Next we will copy the local files(here the text files) to Hadoop HDFS
$ hadoop dfs -copyFromLocal /tmp/gutenberg /Users/hduser/gutenberg $ hadoop dfs -ls /Users/hduser
If the files are successfully copied we will see something like below – Found 2 items
drwxr-xr-x – hduser supergroup 0 2013-05-21 14:48 /Users/hduser/gutenberg
Furthermore we check our file system whats in /Users/hduser/gutenberg:
$ hadoop dfs -ls /Users/hduser/gutenberg
Found 7 items -rw-r--r-- 2 hduser supergroup 336705 2013-05-21 14:48 /Users/hduser/gutenberg/pg132.txt -rw-r--r-- 2 hduser supergroup 581877 2013-05-21 14:48 /Users/hduser/gutenberg/pg1661.txt -rw-r--r-- 2 hduser supergroup 1916261 2013-05-21 14:48 /Users/hduser/gutenberg/pg19699.txt -rw-r--r-- 2 hduser supergroup 674570 2013-05-21 14:48 /Users/hduser/gutenberg/pg20417.txt -rw-r--r-- 2 hduser supergroup 1540091 2013-05-21 14:48 /Users/hduser/gutenberg/pg4300.txt -rw-r--r-- 2 hduser supergroup 447582 2013-05-21 14:48 /Users/hduser/gutenberg/pg5000.txt -rw-r--r-- 2 hduser supergroup 384408 2013-05-21 14:48 /Users/hduser/gutenberg/pg972.txt
- We start our MapReduce Job
Let us run the MapReduce WordCount example:
$ hadoop jar hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output
N.B.: Assuming that you are already in the HADOOP_HOME dir. If not then,
$ hadoop jar ABSOLUTE/PATH/TO/HADOOP/DIR/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output
Or if you have installed the Hadoop in /usr/local/hadoop then,
hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output
The output is follows something like:
13/05/22 13:12:13 INFO mapred.JobClient: map 0% reduce 0% 13/05/22 13:12:59 INFO mapred.JobClient: map 28% reduce 0% 13/05/22 13:13:05 INFO mapred.JobClient: map 57% reduce 0% 13/05/22 13:13:11 INFO mapred.JobClient: map 71% reduce 0% 13/05/22 13:13:20 INFO mapred.JobClient: map 85% reduce 0% 13/05/22 13:13:26 INFO mapred.JobClient: map 100% reduce 0% 13/05/22 13:13:43 INFO mapred.JobClient: map 100% reduce 50% 13/05/22 13:13:55 INFO mapred.JobClient: map 100% reduce 100% 13/05/22 13:13:59 INFO mapred.JobClient: map 85% reduce 100% 13/05/22 13:14:02 INFO mapred.JobClient: map 100% reduce 100% 13/05/22 13:14:07 INFO mapred.JobClient: Job complete: job_201305211616_0011 13/05/22 13:14:07 INFO mapred.JobClient: Counters: 26 13/05/22 13:14:07 INFO mapred.JobClient: Job Counters 13/05/22 13:14:07 INFO mapred.JobClient: Launched reduce tasks=3 13/05/22 13:14:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=118920 13/05/22 13:14:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/05/22 13:14:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/05/22 13:14:07 INFO mapred.JobClient: Launched map tasks=10 13/05/22 13:14:07 INFO mapred.JobClient: Data-local map tasks=10 13/05/22 13:14:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=54620 13/05/22 13:14:07 INFO mapred.JobClient: File Output Format Counters 13/05/22 13:14:07 INFO mapred.JobClient: Bytes Written=1267287 13/05/22 13:14:07 INFO mapred.JobClient: FileSystemCounters 13/05/22 13:14:07 INFO mapred.JobClient: FILE_BYTES_READ=4151123 13/05/22 13:14:07 INFO mapred.JobClient: HDFS_BYTES_READ=5882320 13/05/22 13:14:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6937084 13/05/22 13:14:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1267287 13/05/22 13:14:07 INFO mapred.JobClient: File Input Format Counters 13/05/22 13:14:07 INFO mapred.JobClient: Bytes Read=5881494 13/05/22 13:14:07 INFO mapred.JobClient: Map-Reduce Framework 13/05/22 13:14:07 INFO mapred.JobClient: Reduce input groups=114901 13/05/22 13:14:07 INFO mapred.JobClient: Map output materialized bytes=2597630 13/05/22 13:14:07 INFO mapred.JobClient: Combine output records=178795 13/05/22 13:14:07 INFO mapred.JobClient: Map input records=115251 13/05/22 13:14:07 INFO mapred.JobClient: Reduce shuffle bytes=1857123 13/05/22 13:14:07 INFO mapred.JobClient: Reduce output records=114901 13/05/22 13:14:07 INFO mapred.JobClient: Spilled Records=463427 13/05/22 13:14:07 INFO mapred.JobClient: Map output bytes=9821180 13/05/22 13:14:07 INFO mapred.JobClient: Total committed heap usage (bytes)=1567514624 13/05/22 13:14:07 INFO mapred.JobClient: Combine input records=1005554 13/05/22 13:14:07 INFO mapred.JobClient: Map output records=1005554 13/05/22 13:14:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=826 13/05/22 13:14:07 INFO mapred.JobClient: Reduce input records=178795
- Retrieving the Job Result
To read directly from hadoop without copying to local file system:
$ hadoop dfs -cat /Users/hduser/gutenberg-output/part-r-00000
Let us copy the the results to the local file system though.
$ mkdir /tmp/gutenberg-output $ bin/hadoop dfs -getmerge /Users/hduser/gutenberg-output /tmp/gutenberg-output $ head /tmp/gutenberg-output/gutenberg-output
We will get a output as:
"'Ample.' 1 "'Arthur!' 1 "'As 1 "'Because 1 "'But,' 1 "'Certainly,' 1 "'Come, 1 "'DEAR 1 "'Dear 2 "'Dearest 1 "'Don't 1 "'Fritz! 1 "'From 1 "'Have 1 "'Here 1 "'How 2
The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Resources:
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
- http://hadoop.apache.org/docs/current/
very bad