Evaluation of the IBM Distributed Parallel Programming Environment for Java (DPPEJ)
Introduction
The DPPEJ is described on the
IBM AlphaWorks Web site. It uses Java
RMI to enable applications incorporating parallel algorithms to be easily implemented in Java and deployed on a mixed platform network. Parallel algorithms can be written using concepts similar to
MPI programming, but the programmer does not have to understand RMI or Java threads. A companion product, the Grid Applications Framework For Java (
GAF4J ) allows DPPEJ applications to run on
Globus grids. I performed a series of tests to determine whether DPPEJ could be used to implement fast, interactive environmental data processing applications. I used Java version 1.4.2 running on Linux and Solaris for these tests.
Running Programs With DPPEJ
DPPEJ comes with two example programs, which are included in the JAR file that contains the DPPEJ Java classes (dthread.jar). The source code for these example programs is also included, in directory
examples, but in the case of the Array Addition program, this does not correspond exactly to the compiled classes in the JAR file. The compiled version of the Array Addition program in the JAR file takes two arguments, the number of threads to be used and the size of an array, but in the source code these values are hard coded.
I wrote a simple Java program to look through a series of files in a marine data set, looking for sea surface temperatures above a threshold value. The output from the program was the number of files in the data set containing at least one temperature above the threshold, and a list of the paths to those files. The data set I used contains 1808
NetCDF files, each 22MB in size. In my tests the first 250 files in this set were searched for temperatures greater than 32 Celsius. I used the Array Addition program source code as the basis for my program because it is quite simple. The source code for my program is attached to this document.
By following the instructions in the DPPEJ documentation I managed to get the example programs working straight away, but making my own code work was not quite so straightforward. I found that all programs must be run from the $DPPEJ_HOME/bin directory, and a directory or JAR file containing the compiled classes must be present in $DPPEJ_HOME/bin. To enable the program to be run from within the
NetBeans development environment I created two symbolic links between $DPPEJ_HOME/bin and the NetBeans project directory for my program. Firstly, I created a link in $DPPEJ_HOME/bin pointing to the location of the directory containing the compiled classes (TempExceeds/build/classes/tempexceeds). Secondly, I created a link in the NetBeans project directory (TempExceeds) pointing to the daemonregistry.properties file in $DPPEJ_HOME/bin.
The distributed daemons seem to remember the classes as they are when they are first executed, so if any changes are made to the code while the daemons are running, it is the original version of the code that is executed and not the revised version. Therefore, the daemons must be stopped and restarted every time the classes are recompiled.
Performance of DPPEJ With Example Programs
The Array Addition program runs so quickly that it is difficult to gain any benefit from distributed processing. Attempting to count very large arrays results in a �out of memory� error. I tested the All Pairs Shortest Path (ASP) Algorithm by performing repeated timed runs with different numbers of threads. The results for threads running on our network of Linux machines are shown in ASP_Linux750.jpg. In the figure, the run time for each number of threads is the average from ten separate runs with the problem size parameter set to 750, and the error bars show the standard deviation. The plot shows that there was a substantial decrease in run time when three threads were used, but only a small benefit to every additional thread after that. The strange increase in run time at four threads is probably due to other processes that were running on those particular computers while those runs were in progress. Running this algorithm with multiple threads achieved a reduction in average run time of nearly a factor of four.
It is possible that even greater improvements in performance could be achieved for larger problems, where the overheads involved in scheduling and tracking the distributed threads would be smaller in comparison with the processing time. However, it was not possible to obtain any results for larger ASP problem sizes because the distributed daemons became unreliable when the ASP problem size parameter was increased to 1000 or more. After a number of runs the program usually crashed, giving a series of Java messages beginning with the following lines.
DThreadContractor::run() Errorerror marshalling arguments; nested exception is:
- java.net.SocketException
- Broken pipe
It is as if the daemons became tired after a while, and restarting the daemon registry and the all daemons was the only way to cure the problem. With a problem size of 1500, as suggested in the DPPEJ documentation, it was impossible to repeat the run more than once without restarting all the daemons. With a problem size of 1000 I was able to repeat each run successfully exactly 17 times between restarts, and with a problem size of 750 I experienced no problems at all. It seems as if the daemons fail after they have performed a certain amount of work. One possible explanation is that they fail after they have transferred a particular amount of data between the distributed threads and the master node, which might explain why more runs of the smaller problem sizes were possible between restarts. This would probably be an easy bug to track down and fix.
Performance of DPPEJ With Temperature Searching Program
The tests involved measuring the time taken to search through the first 250 of the [[http://www.unidata.ucar.edu/software/netcdf/][NetCDF]] files looking for extreme temperatures. All the files were stored on a single server, which contains much of the data used by the researchers at
ESSC. The load on this server varies throughout the day, which affects the time taken to access a particular set of files. To prevent these load variations affecting my results I ran each of my tests ten times. By splitting the task into six or more threads running in parallel on separate machines it was possible to reduce the run time of this task by a factor of five, compared to carrying out the task sequentially on one processor in the conventional way. The results are described in more detail below.
The first test involved running all the daemons on a single Linux machine called devon. A plot showing the variation of run time with increasing numbers of threads is shown in file DevonResults.jpg, which is displayed at the end of this document. The run time for each number of threads is the average from ten separate runs, and the error bars show the standard deviation. As expected, there was little benefit to dividing the work between several threads all running on the same machine. When this test was repeated with the threads running on the data server itself, the run times were only slightly shorter. These results are shown in CadfaelResults.jpg. This shows that the time taken for the data to be transferred over the local network from the server to the other machines running my program is small compared to the total run times.
The second test involved running the threads on different Linux machines. The results are shown in file LinuxResults.jpg, which shows that there was a big decrease in run time when two threads running in parallel on different machines were used, but there was no benefit to using more than five threads running in parallel. This is probably because when there are four or more threads trying to access the same disk on the server at the same time, the application becomes limited by the capacity of the server to deliver the data.
The third test involved running the threads on a set of aging Sun Sparc computers running Unix. The results are shown in file Unix_Results.jpg, which shows that there is little benefit to be gained by having more than about five threads, as expected from the tests on the Linux machines. There is a strange increase in average run time at 8 threads, which might represent the point at which the overheads associated with the distributed processing outweighed the benefits of having the extra thread, but this does not explain why the average run times for nine threads and above are similar to the average run time for seven threads. The difference between the processing speeds of the individual machines running the threads did not have a significant effect on the run times for this data intensive application, but the run times are surprisingly long compared to the Linux times. This might be explained by the deficiencies of the older hardware in the Suns, but this does not explain why the run times do not decrease much with five or more threads. The data server appeared to be causing a bottleneck at run times of about 30 seconds when it was supplying the same data to the faster Linux machines, so it is not clear why this server bottleneck would occur at 100 seconds just because the data was being processed on slower hardware.
Concluding Remarks
It is relatively easy to implement and deploy Java programs with parallel algorithms using DPPEJ, and these simple tests have shown that it can be used to reduce average run times by a factor of five or more. However, DPPEJ seems to be unreliable when used for computationally intensive tasks. If the reliability problems could be solved it would be worth considering the use of DPPEJ for interactive applications where run time is important. However, DPPEJ is proprietary technology and the licensing situation is not clear. I downloaded a free trial version of DPPEJ for evaluation. The trial license lasts for 90 days, but DPPEJ and
GAF4J are not on the list of licensable AlphaWorks products included in the AlphaWorks licensing programme. I have contacted IBM to request license terms for the use of DPPEJ and
GAF4J for research purposes. If it is possible to obtain a licence to use these tools, it would only be sensible to consider deploying them if IBM are prepared to solve the reliability problems themselves, or ideally, to release the source code.
The DPPEJ information was posted on the IBM Web site in August 2004, and it is not clear whether any development has taken place since then. It is possible that DPPEJ and
GAF4J have been incorporated into other IBM products and are no longer being developed as separate projects.
Attachments
- CadfaelResults.jpg:
- DevonResults.jpg:
- LinuxResults.jpg:
- Unix_Results.jpg:
--
Main.DanBretherton - 04 Apr 2006