Evaluation of Hadoop
Introduction
Hadoop is a distributed computing platform written in Java. This report describes an evaluation of Hadoop to determine its suitability for speeding up operations on large environmental science data sets. I used Hadoop version 0.1.0 for the evaluation, the first official release. The latest version at the time of writing is 0.5.0. The test application was run on computers in the
ESSC network, running various versions of Linux and Solaris and various different versions of Java.
According to the
Hadoop wiki,
"Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. The intent is to scale Hadoop up to handling thousands of computers."
The cluster referred to above can be any set of networked computers running Linux or Unix with Java installed. The word "cluster" is not used to mean a high performance computing (HPC) cluster like the one at ESSC (see
ClusterNotes), and does not imply the presence of cluster management software such as
Condor or
Sun Grid Engine. Hadoop includes a distributed file system, a Web portal for job monitoring and an implementation of
MapReduce, a parallel programming model developed by Google.
Unlike parallel programming models based on message passing such as
MPI, MapReduce lends itself more to trivially parallel tasks. In the Hadoop implementation of MapReduce, a large task can be parallelised by splitting it into a number of units of work, each defined by a record in a text file. Each unit of work is assigned (or
mapped) to a different computer in the cluster, which runs a map process to carry out the unit of work. The map processes run in parallel with each other, each one producing results in the form of one or more key-value pairs. These key-value pairs are collected by
reduce processes which analyse the results. There is one reduce process for each key in the output from all the map tasks, and the reduce tasks can be run in parallel on different computers, just like the map tasks.
Installation and configuration
These comments are intended to supplement the information given in the
Hadoop wiki. Installation is just a matter of expanding the zipped tar file downloaded from the
Hadoop Web site. For developing Java applications with Hadoop, I found it helpful to add the Hadoop jar file (called hadoop-0.1.0.jar in version 0.1.0) to my CLASSPATH environment variable.
To configure Hadoop for my test application, three files in the
conf directory had to be edited.
hadoop-env.sh
This bash shell script is executed once on each of the computers in the cluster. The only environment variable that must be defined for all Hadoop installations is JAVA_HOME, the path to the Java installation. If the path is different on different computers, some shell scripting is required to produce the right path to every computer running the script. For example, at ESSC it is necessary to differentiate between Linux and Unix machines when defining the Java path.
Two additional variables had to be defined for the Hadoop installation at ESSC, to override some of the default values which are described by the comments in the file. The value of HADOOP_OPTS was set to "-Dorg.mortbay.xml.XmlParser.NotValidating=true" to stop Java trying to validate various XML files using a remote schema via the Internet. HADOOP_SSH_OPTS was set to " ", because the default value contained ssh options that were not valid on some of our computers.
hadoop-site.xml
This file is used to override default values of one or more of the parameters defined in hadoop-default.xml. ESSC-specific values for the following parameters were given in hadoop-site.xml:
fs.default.name
mapred.job.tracker
mapred.local.dir
mapred.system.dir
mapred.temp.dir
dfs.name.dir
dfs.data.dir
dfs.replication
Explanations of these parameters are provided in the comments in hadoop-default.xml. One important thing to note about using Hadoop is that the distributed file system must always be used, even if common data storage directories are mounted on all the computers in the cluster as is the case at ESSC. The default value for
fs.default.name is "local", which switches off the DFS, but this must only ever be used if
mapred.job.tracker is also "local", meaning that no distributed processing will be carried out. In other words, if you are doing distributed processing you need to use the distributed file system, even if you don't think you need to. On no account try to use the DFS over NFS by specifying NFS-mounted paths for any of the DFS parameters; This will cause Hadoop to fail.
slaves
This is simply a list of host names specifying the computers in the cluster that are available to run map and reduce tasks. Before running Hadoop programs, various Hadoop daemons must be launched on all the computers in the cluster. This is done by running the
start-all.sh script in the
bin directory of the Hadoop installation.
Performance in temperature search test
The test problem I used to evaluate Hadoop was the same as the one I used for the evaluation of IBM's
Distributed Parallel Programming Environment for Java (DPPEJ); the test problem and the evaluation of DPPPEJ are described here:
EvaluationOfDPPEJ. Briefly, the test problem involved searching through a series of files in a marine data set, checking for sea surface temperatures above a threshold value. I adapted the code from the Word Count example program that comes with Hadoop. The source code for the temperature search program is attached to this document. The
gadsutils package is common to the DPPEJ and Hadoop temperature searching programs. Links to the attachments are provided at the end of this document.
In the evaluation of DPPEJ, the time taken to search through the first 250 files in the data set was measured. Hadoop results for this test are not given here because parallel processing did not reduce the run time at all. This probably indicates that Hadoop's distributed processing overheads are greater than those of DPPEJ, with the result that the benefits of parallel processing are negated by the costs of launching and monitoring the distributed computational tasks. This is not surprising, because Hadoop is a much larger and more complex system than DPPEJ. Unlike DPPEJ, Hadoop has a distributed file system, a job monitoring portal and a method of fault tolerance that ensures that jobs can be completed if one or more computers in the cluster experience problems. All these extra features incur a computational cost, which means that Hadoop is useful for applications where the time taken to do the processing is large compared to the computational overheads of Hadoop itself.
In an attempt to find a more suitable test for Hadoop, the number of files involved in the temperature search was increased to 1808, covering the whole data set. The figure shown in file Unix1808.jpg (which can be seen at the end of this document) shows the results from a series of tests involving a search through the whole set of 1808
NetCDF files. The time taken for the search was measured with parallel processing on different numbers of computers, ranging from 1 to 13. This test was repeated ten times, and each time measurement in the figure is the average of the ten separate results for each number of computers. The results show that parallel processing with Hadoop reduced the time taken to perform the task by a factor of approximately three, compared to the time taken to run the same program sequentially on a single computer. This is a smaller time reduction than was achieved by DPPEJ in the 250-file test, which indicates that bigger problems would be needed in order to see Hadoop perform well.
Another thing to note about the figure in Unix1808.jpg is the large variation in run time over the ten separate tests, indicated by the large error bars. As these run times are dominated by Hadoop's distributed processing overheads, it is likely that run times on bigger problems would be more consistent.
Conclusions
Hadoop seemed to be very stable and reliable on a range of platforms, but it was not well suited to the problem I tested it on. I expect that the performance would be much more impressive for tasks distributed over much larger numbers of computers, meaning hundreds or even thousands, as suggested by the extract from the Hadoop wiki given above. The MapReduce programming framework is quite flexible, and could be used as a way to achieve parallel processing in a wide range of applications. However, the requirement to define distributed units of work by lines in text files is potentially limiting, and would certainly result in some very cumbersome code if the method was used to parallelise a complex algorithm.
Attachments
- Unix1808.jpg: Test results for temperature search program running on UNIX machines at ESSC:
--
DanBretherton - 15 Aug 2006