Grid Remote Execution (G-Rex) is Java grid middleware that allows programs to be exposed as Web
services on remote clusters. It has two components, a client and a server.
The client is a Java command line program called grexrun, which
takes the place of the remote program in work-flow
scripts. In the case of parallel programs, grexrun typically replaces
mpirun. The server component is a Web application that runs inside a servlet
container such as Apache Tomcat. This has to be installed on the remote
cluster.

G-Rex has two key features that make it particularly suitable for running
large climate models:

1) Output files from the remote model are automatically transferred back to
the user while the model is running, and are deleted from the server when
they are no longer needed. This minimises the data footprint of the G-Rex
services, and allows users to continuously monitor the scientific progress of
the runs.

2) The G-Rex client program, grexrun, behaves in the same way as the model
would if it were running on the user's own computer. This makes it easy to
to construct work-flows involving one or more remote services using shell
scripts. Output from the remote model becomes output from grexrun, and
grexrun waits until the remote job is finished before coming to an end
itself.

POLCOMS is now available on the 3 clusters in the NERC Cluster Grid and on HECToR via G-Rex. Before
these can all be accessed from POL and Daresbury (and other institutes) there are some firewall
issues to sort out. They are different for each cluster in the grid

HECToR :

G-Rex services for HECToR are available from ESSC and POL at the moment. POL users can access the services via port 2006 on livlbo (James' desktop PC). If livlbo is unavailable or the services are not working, port 2006 on livgen can be used for testing, but please bear in mind that data transfer using this method goes via ESSC.

ESSC:
G-Rex is only available from POL at the moment, via a SSH tunnel to port 8080
on livgen (more on that later). To make the ESSC services available from
other institutes I would need to have a user account (or access to someone else's)
to enable me to set up a SSH tunnel.

BAS:
I've asked BAS to open up the G-Rex port to POL, but at the moment the BAS
G-Rex services are only accessible from ESSC. Please let me know if you want me to ask for the port to be
opened to any other institutes. BAS would need to know the IP address of the
relevant sub-network at the remote institute. SSH tunnels originating from BAS are not
allowed.

POL:
G-Rex is only available from ESSC at the moment, via a SSH tunnel. To make
the POL services available from another institute I would need access to a user
account to be able me to set up a SSH tunnel.

So basically, the situation at the moment is that all the services are
available from ESSC but the ESSC services can only be accessed from POL.

Here are some brief instructions for POL users on how to use the HECToR
services. The best way to start is probably to look at how I have set things
up in /work/dab/RUN_POLCOMS. The working directory is HUMB10 and the code is
in CODE. The latest version of G-Rex is in /work/dab/G-Rex.main. G-Rex needs the Sun Java
runtime environment, version 1.5 or above. GNU Java, which is present on
most Linux systems by default, probably won't work. The Java I use is
in /work/dab/jdk1.5.0_06-linux. To make G-Rex use this Java instead of GNU
Java, add /work/dab/jdk1.5.0_06-linux/bin to the start of your PATH variable.
When this has been done, check that "which java" gives the correct path,
which in my case is /work/dab/jdk1.5.0_06-linux/bin/java.

Remote G-Rex services are accessed using the G-Rex client program. In Linux
this is a script called grexrun.sh in directory G-Rex/code/bin. The G-Rex
directory can be anywhere and renamed anything you like. I suggest adding
G-Rex/code/bin to your PATH. The format of a grexrun command is shown below.

grexrun.sh http://<USER_ID>:<PASSWORD>@<HOST>:<PORT>/G-Rex/<SERVICE> [<PARAM1>
[PARAM2]... ]

The example commands I have given below include my user name and
"password" instead of the password.

Running any service with the single parameter --grex-help gives information
about what parameters the service expects. For example, the following
command gives information about the parameters needed by the make-polcoms
service.

grexrun.sh http://db:password@livlbo:2006/G-Rex/make-polcoms --grex-help

Here are some details about each of the services that are needed to compile
and run POLCOMS on HECToR . The example commands were executed on
livgen in directory /work/dab/RUN_POLCOMS/HUMB10.

1. make-polcoms
--------------------
The following command compiles POLCOMS on HECToR :

grexrun.sh http://db:password@livlbo:2006/G-Rex/make-polcoms source.tar.gz
work/dab/RUN_POLCOMS/CODE /work/dab/RUN_POLCOMS/C
ODE/make_polcoms-HECTOR/make_gcoms_phys shelf

  • The first grexrun command line parameter is the URL of the service.
  • The second parameter is the path of a compressed tar file containing the code.
  • The third parameter is the path to the code inside that tar file. The third parameter is needed because the script that
    executes the service at HECToR needs to know where to go after it has unpacked
    the source archive. If you don't understand why that is, try
    unpacking /work/dab/RUN_POLCOMS/HUMB10/source.tar.gz; you will get a
    directory called "work".
  • The fourth parameter is the path to what I've been calling the "options script".
  • The output from the service is the POLCOMS
    executable file. The name of this file (usually "shelf") is specified in the fifth grexrun parameter.

The version of the machine_list file in /work/dab/RUN_POLCOMS/CODE/v6.4/ has
some extra sections for compiling at ESSC and BAS. The section I used to
compile at ESSC is called shelf-essc-mpich-mx-pgi. There are equivalent
sections in the file for gigabit ethernet too.

2. polcoms
-------------
The example below is a command that runs POLCOMS on HECToR :

grexrun.sh http://db:password@livlbo:2006/G-Rex/polcoms input.tar.gz HUMB10.run001.01.1989 --drm-name HUMB10 --drm-walltime
1:00:00 --pbs-mppwidth 12 --pbs-mppnppn 2 --tdur 744.0 --tchk 744.0 --nens 1

The parameters needed by the service are described below. First the unflagged
parameters in order of position on the command line (after the service URL)

1. Path of a compressed tar file containing all the input data required for
the run. In the model configuration I've been using, my input archive
contains the directory named "data" and the following files: filenames.dat,
logicaloptions.inp, metparams.dat, parameters.dat, scoord_params.dat,
openbcpoints.dat and shelf. In principle this archive could contain
anything, which makes the service more flexible. For example, if a new
version of POLCOMS requires an extra input file then that can just be added
to the archive.
2. The run identifier, which is used by G-Rex to identify the output files
when they are produced. G-Rex doesn't look inside filenames.dat to find the names of the output file names.

Next are the flagged parameters, which in this service are used to control
the behaviour of the Distributed Resource Manager (DRM), which is PBS in the case of HECToR . All the DRM related parameters have
flags beginning with "--drm-", or "--pbs-" for flags that are specific to PBS. The "--drm-" parameters also work for services linked to clusters using Sun Grid Engine as the DRM

  • Name of DRM job. This is optional. A default name is used if a name is
    not specified.
  • The maximum wall-clock time. This is optional. The default value is 12:00:00.
  • PBS mppwidth and mppnppn parameters.
  • Name of file for batch job standard output and error streams. These parameters are
    optional. If a filename is not specified the DRM generates a default name based
    on the batch job number. In G-Rex the names "stdout" and "stderr" are special. Output
    files of those names become the standard output and error streams of grexrun.
  • POLCOMS parameters. Note that the flagged options are preceded by "--" instead of the "-" used by the POLCOMS executable (usually called "shelf"). G-Rex converts these flagged options to the single hyphen format before launching POLCOMS on the remote cluster.

At the last GCOMS meeting we discussed the possibility of G-Rex services that
do several runs in one batch job, to make best use of a slot after waiting a
long time to run on a system like HECToR . I haven't implemented any services
like that yet because I thought it would be better to keep things simple to
begin with. If someone can give me a specific example of the type of
composite batch job we are likely to need I will have a go at setting
something up.

3. qstat
---------
Here's the command:
grexrun.sh http://db:password@livlbo:2006/G-Rex/qstat

It isn't clever enough to accept all the PBS options at the moment.

4. qdel
--------
The service takes one parameter, the name of the batch job to be deleted. For
example:
grexrun.sh http://db:password@livlbo:2006/G-Rex/qdel 1234

I hope this will provide enough information to get you going. It is worth
pointing out that I use a couple of scripts to make executing the long winded
grexrun commands a bit easier. If you are interested, have a look at /work/dab/grex/scripts/run-polcoms_HECTOR.sh, and the script that calls it for the HUMB10 example used above - /work/dab/RUN_POLCOMS/HUMB10/run-polcoms_HECTOR-WikiExample.sh.

Feel free to give me a ring if you need any help while you are trying out the services.

-- DanBretherton - 19-Apr-2009

Topic revision: r3 - 19 Apr 2009 - 22:27:30 - DanBretherton
Resc.GridRemoteExecution moved from Resc.EScienceGrid on 09 Apr 2008 - 16:49 by DanBretherton - put it back
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback