Running UM jobs on the BAS Cluster

To run UM jobs on the BAS cluster from ESSC, you will need the following.

Initial setup.

You will want an account on the BAS cluster. This is in some sense called "bslhades" but that is the name of the head node that you cannot log into. From ESSC you can log into bslcene.nerc-bas.ac.uk. That is not pat of the cluster, but it can see the cluster disks, and also you can see the queues from there (if you first type "setup v5 sge"). From there you can also log into the cluster workstation nodes, which are not directly accessible from ESSC. These are: bslhades-ws1 (up to ...ws5).

From the workstation nodes, you can see the whole cluster, including the execution nodes (quad001 to quad004 and node001 to node???). It is possible to log into the execution nodes, but this should only be for very limited purposes, e.g. to show your processes if there is a problem, or you risk disrupting people's jobs. Any jobs you submit on the workstation queue will run on the workstation nodes; the model is set up to do this for compile jobs. Other jobs will run on the main execution nodes.

Main thing to be aware of is that home directory is different on the workstation nodes (and bslcene) from on the execution nodes. Everywhere it is e.g. /users/chuuab, but this points to different places on different nodes. The area /data/hades-users/ is visible on all nodes, but only on the execution nodes is it the home directory. You need to use the same paths to access your UM files which work in both compilation jobs (workstation nodes) and run jobs (execution nodes). I suggest you create a directory /data/hades-users/username/um (this is $HOME/um on execution nodes). Then create symlink $HOME/um on workstation nodes that points to it. Then $HOME/um will always see the same files, so put everything you will use for the model in ~/um and call that directory $MY_UMHOME in your UMUI jobs.

But there are a couple of files that need to be directly in your home directory because the model expects to find them there. What you can do for each of these is to put it under $HOME/um and then create symlinks to it in both home directories. They are:

  1. A "setvars" file. Copy the one from ~chuuab
  2. A "umui_runs" directory.
  3. A .profile file with the one line ". $HOME/um/setvars"

You also want ssh keys set up so that you can ssh freely between nodes without a password - or at the very least you need to be able to ssh from the execution nodes into bslcene because of how job output is transferred to ESSC. I assume you know how this is done, otherwise ask and I'll
document it.

You will also want a ".cshrc" file in your home directory on bslcene, with the line: "setenv SGE_ROOT /packages/sge"

In the UMUI job.

NOTE: see job xabsd as an example.

As a minimum, you will want the following differences from the corresponding job on gorgon:

  • -> Model Selection
    • -> User Information and Target Machine
      • -> General details
        • Entry box: Running the Job on the following machine
        • Entry is set to 'bslcene.nerc-bas.ac.uk'
        • Entry box: Target Machine user-id
        • Entry is set to your username
Note here: the target machine is the one which is visible from outside BAS, through which jobs can be submitted to the queues, rather than the cluster that it will actually run on.

  • -> Model Selection
    • -> Sub-Model Independent
      • -> File & Directory Naming. Time Convention & Envirmnt Vars.

Set DATAM and DATAW to something under $MY_UMHOME, for example:

  • $MY_UMHOME/datam/$RUNID and $MY_UMHOME/dataw/$RUNID

Also set these environment variables:

  • CLL_UMHOME /users/chuuab/um
  • LOCAL_MODS $CLL_UMHOME/mods
  • LOCAL_SCRIPTMODS $CLL_UMHOME/script_mods
  • LOCAL_OVERRIDES $CLL_UMHOME/comp_overrides
  • DUMPS $CLL_UMHOME/dumps/HadCM3
  • ANCIL $CLL_UMHOME/ancil

And here:

  • -> Model Selection
    • -> Sub-Model Independent
      • -> Compilation and Modifications
        • -> User-defined compile option overrides

You want:

  • $LOCAL_OVERRIDES/mygcom
and
  • $LOCAL_OVERRIDES/O3

NOTE: the use of the compile option override (-O3) will force a recompilation of the whole model. But the in the model installation has been set up with every optimisation under the sun, and the result will crash when you try to run it, so this compile option override with make it "only" use "-O3". To avoid lots of recompilation time, run from existing executable where possible, but still it should be an executable that you've compiled on the cluster, not one from ESSC.

To get the output back, you need to have enabled the archiving script. Make sure that in this window:

  • -> Model Selection
    • -> Sub-Model Independent
      • -> Post Processing
        • -> Main Switch + General Questions

you have got the main switch set to yes at the top, and also both delete superseded PP files and delete superseded restart dumps enabled.

You also need to set up archiving in

  • -> Model Selection
    • -> Sub-Model Independent
      • -> Post Processing
        • -> initialisation and processing of standard PP files

but be aware that this will be ignored without the "delete superseded PP files" option enabled.

To submit the job.

Setting up rsync

Prior to running the job, you need to set up an rsync server listening over an ssh port forward, so that the output archiving system at BAS (equivalent of the one which on pegasus merely copies files to ~/um_archive) will run an rsync client on bslcene to connect to this and transfer the files back to ~/um_archive on pegasus -- NB this will happen via the "putfile" script (~chuuab/um/bin/putfile)

Here's how to do it.

Initial setup

at BAS (on pegasus):

  • make yourself a copy of the script ~iwi/bas/rsync/setup-daemon
  • edit this to change the path of the config file (in variable $conf_file) to something of your choice
  • make a copy of the config file ~iwi/bas/rsync/conf, with the path name that you specified
  • edit this rsync config file, to change the path names of the log file, lock file, and secrets file
  • create empty lock and log files, and create a secrets file (with restricted permissions) containing the line:
    • umwrite: password
  • password above is something arbitrary that you make up

at ESSC (on bslcene):

  • Create yourself directories ~/.rsync and ~/.rsync/default
  • Then create file ~/.rsync/default/password, containing only a single line, which consists only of the password that you chose above. Restrict the permissions on this file.

To start the server (each time if not already running)

Run your copy of the "setup-daemon" script. If all goes well, you should see something like this (but with different port numbers):

Trying to launch and test rsync server on port 15490:
success
Trying to set up and test port forward (bslcene.nerc-bas.ac.uk:13120 -> localhost:15490):
launched ssh
success
rsync://umwrite@localhost:13120/umarchive/ will now work at bslcene.nerc-bas.ac.uk
(saved in $HOME/.rsync/default/path)

You can also test that the server is running and the port forward is set up, by trying to run the "putfile" script on bslcene (usage: "putfile filename subdirectory " where subdirectory is relative to your ~/um_archive on pegasus).

You may also find that the ssh connection will eventually time out if there is no data transferred through it. So it is recommended to launch a script on bslcene such as the following, just to keep the connection alive but without consuming too much resources:

#!/bin/sh
echo hello > my_small_file
while true
do
~chuuab/um/bin/putfile my_small_file testdir
sleep 300
done

Actually submitting the job

The easiest way to submit the job is to use the "umsubmit" script on pegasus. This is provided as one of the helper scripts used by the UMensemble package (/usr/local/Cluster-Apps/UMensemble/helpers/umsubmit), although you may also have a copy elsewhere. If you use this script, check the comment lines for its usage, but you probably want something like:
  • umsubmit -h bslcene.nerc-bas.ac.uk -u user_id job_name
Alternatively, you can invoke umsubmit via the ens_sub.py script in the UMensemble package, whether or not you actually have an ensemble. If your job is an ensemble template, then invoke it with:
  • ens_sub.py tempate_id
or otherwise, for a single run ID (either an ensemble member or a non-ensemble setup), add the "-s" flag:
  • ens_sub.py -s run_id

Note regarding ensembles

There is nothing in principle any more complex about running an ensemble job. The ensemble generation utility (see GeneratingEnsembles) running on pegasus will create a set of individual runs, each of which is submitted to the BAS cluster via ens_sub.py, i.e. the BAS cluster does not need to know that these are ensemble members. For an example ensemble template job, see xabsf, which is designed to be used with differing ocean start dumps (with the "-o" flag on ens_gen.py) -- this has been tested. A copy of the tools for generating the set of start dumps is now installed in ~chuuab/um/UMensemble/, or alternatively these could be generated on pegasus and then copied across.

Topic revision: r5 - 22 Apr 2008 - 14:31:23 - AlanIwi
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback