Generating ensembles
Old development stuff has been moved to GeneratingEnsemblesDevel. Usage instructions follow.
A tool for making UM initial condition ensembles has been added to gorgon / pegasus, and is available for other systems also.
Initial set up
- On pegasus / gorgon, type
module add UMensemble. This will add the directory /usr/local/Cluster-Apps/UMensemble to your executables PATH. (Alternatively, run executables out of this directory explicitly.)
- To install on another machine, copy the tarball (attached) to a directory of your choice, extract, cd to the top directory and run the Compile script. (Note - the Compile script will compile the C code, and also byte-compile the Python code for efficiency; the latter might happen at run-time anyway, but only if the invoking user has write permission to the installation.) Also if required, edit
common_functions.py as described under "job naming" below.
Usage overview
These utilities are designed to be used in the following way:
- You create a template job in the UMUI, which will be as the ensemble members that are run, except that the filename of the atmos and/or ocean start dump is a template on which the names of the actual start dumps will be based.
- You process the template job in the UMUI, producing job files in
~/umui_jobs for the template job.
- You run a utility to generate the job files for the ensemble members in
~/umui_jobs. Note that these jobs do not exist in the UMUI database.
- You run a utility to generate the necessary start dumps.
- You run a utility to submit the ensemble members.
(Note, in your template job, it is probably sufficient to specify running from an existing executable, rather than have every ensemble member do compilation.)
Job naming
Because of pros and cons of different systems of job naming, the tool currently supports two different options.
The default (on pegasus) is to use a scheme I've called
upperlast4, in which the last four letters of the run ID of the template job are turned into upper case and used as the experiment ID part (i.e. first 4 characters) of the run ID for the ensemble members. The 1-character job ID for the ensemble members is then a, b, c, etc, although this character set also goes on to upper case letters and numbers in the event of large ensembles. For example, if your template job is xabcd, then the ensemble members are ABCDa, ABCDb, etc. This should guarantee unique run IDs for all the jobs,
provided that the first character of the experiment ID in your UMUI database is always the same (e.g. "x"), but might not do so if also you have e.g. a template yabcd.
If your UMUI database does not always have the same initial letter for experiment IDs, then you may prefer instead to use the following scheme (I've called
first4). In this case, the experiment ID for the ensemble members is the same as for the template -- the first 4 characters of the template run ID. The template is only allowed to be job ID "a" in the experiment, and in this case the ensemble members will have job IDs: b, c, etc. It is then up to you to avoid creating other UMUI jobs in the experiment which will conflict with the ensemble members. For example, if your template job is xabca then the members will be xabcb, xabcc etc. These will not exist in the database; you must remember not to create these same IDs in the database.
The default naming scheme can be set by editing the variable _defaultNameScheme at the top of common_functions.py. However, the command-line utilities will also all accept the flag "-N
" to override this; this may be useful if you are generating the start dumps on a different system from where you are running the UMUI, and the two UMensemble installations have been set to use different defaults.
File naming
Where a start dump for a job is generated from a template filename, the corresponding filename used for the ensemble member has an underscore appended, followed by the one-character job ID of the ensemble member. For example, if the template filename is xaaaaa.da11110 then the members will use start dumps starting with xaaaaa.da11110_a (except with the "first4" scheme, where they start with xaaaaa.da11110_b).
Displaying usage messages
All of the utilities described below will give a usage message if invoked with an invalid command line. The simplest way to do this is to invoke them without any arguments at all. The description below is intended to point you to the relevant commands, but does not include a full list of supported flags for each command, because you can get that, as described.
Note also that some of the examples given in the usage messages assume the "upperlast4" naming scheme.
Generating ensemble members
The tool for generating ensemble members in ~/umui_jobs is ens_gen.py. You need to give it the template run ID and the ensemble size, and also the "-a" and/or "-o" flag, to specify that the atmos and/or ocean start dump vary between ensemble members according to a pattern, and/or the "-C" flag to specify that member information is to be read in from a config file (see below). If for example your template job (say xabcd) uses start dumps xaaaaa.da11110 and xaaaao.da11110, and you supply (only) the "-a" flag, then the atmos dump name will be treated as a template (giving actual start dumps xaaaaa.da11110_a etc), but the ocean dump name will be used as the literal name of the start dump in the ensemble members.
For example, ens_gen.py -a xabcd 5 for a 5-member atmos-IC ensemble.
If your job has dump reconfiguration enabled for a dump which is to be varied between ensemble members, then the initial dump (input to the reconfiguration) is treated as a template filename, but the filename of the reconfigured dump is used as a literal filename.
Member config file
If you want to specify differences between ensemble members other than simply letting the start dumps vary according to a pattern generated from the template, you can do so using a config file, which you include with the option -C config_filename on the ens_gen.py command line.
This file should be in ".ini" file format, with sections [member1], [member2] etc, and variable=value pairs within each section. (See example below.)
The currently supported variables are:
-
adump Atmosphere start dump (full path).
-
odump Likewise for ocean.
-
syear, smonth, sday, shour, sminute, ssecond. Start time of run.
-
lyear, lmonth, lday, lhour, lminute, lsecond. Length of run.
- (Note -- the date fields can all be abbreviated to the first four characters:
syea etc.)
You must specify a section for each ensemble member, but it is only necessary to specify the fields which should be changed from the template job. For example, you do not need to specify the start second, minute etc if only the start year differs from the template.
It is permissible to use the "-a" and/or "-o" flags in addition to using a config file (e.g. if the start dump filenames are generated from a pattern but the start dates are to be specified), but where the "adump" / "odump" options are used these will take precedence.
Here is an example config file for a three-member ensemble with different start years and lengths, but the same finish date. It is permissible to use this config file to generate a two-member ensemble ([member3] section ignored), but not to generate a four-member ensemble ([member4] section missing).
[member1]
syear=1991
lyear=21
adump=/full/path/to/xaaaa.daj1c10
odump=/full/path/to/xaaao.daj1c10
[member2]
syear=1992
lyear=20
adump=/full/path/to/xaaaa.daj2c10
odump=/full/path/to/xaaao.daj2c10
[member3]
syear=1993
lyear=19
adump=/full/path/to/xaaaa.daj3c10
odump=/full/path/to/xaaao.daj3c10
Submitting ensemble members
(Note - before you can usefully do this, ensure that you have generated the start dumps, as shown below.)
To submit all the members of an ensemble that you have generated with ens_gen.py, use ens_sub.py. At the simplest, you give it a single command-line argument which is the template runID (e.g. ens_sub.py xabcd), and it submits the whole ensemble consisting of whichever jobs matching the template it finds in your ~/umui_jobs.
You can also specify the range of jobs to be submitted with "-f" (from) and "-t" (to) flags, the defaults for these being from 1 and to the last member, so for example you might test submitting the first job with ens_sub.py -t 1 xabcd, and then if that works, submit the remainder with ens_sub.py -f 2 xabcd. You can also submit a single member by point its runID in place of the template ID; any run ID that is not valid as a template will be treated as a single run ID to submit.
Generating start dumps
At this point, there are two utilities you can run, either ens_dumps_from_single.py or ens_dumps_from_multi.py, depending whether you are using a single start dump or multiple start dumps in order to generate the set of start dumps. Each of these scripts are generic wrappers that take care of file naming etc, and call other (helper) programs that do the real work. A few helper programs are currently provided for commond cases, but this provides a framework to add more later.
Single input dump
A situation where you would use a single start dump is where different random SST perturbations are added to the same initial dump to produce the start dump for each member. The "real work" command underneath ens_dumps_from_single.py is perturb_sst, a script based on code supplied by Chunlei which adds a perturbation within +0.05 / -0.05C (hard-coded). To use it, you would type for example:
ens_dumps_from_single.py xaaaao.da11110 5 perturb_sst
for a 5-member ensemble. In that case you would use the "-o" flag on "ens_gen.py". (Also in this case you can add the "-m" flag on ens_dumps_from_single.py for efficiency; this makes the perturb_sst program generate all the output dumps in a single invocation, so that it only extracts the input SST once; see the usage message for an explanation.)
Note: the number for each member (1,2,...) is passed to perturb_sst, and is used as the initial random seed for the SST perturbation. This ensures that members will differ, but start dumps can be regenerated repeatably if required.
Multiple input dumps
You would use multiple start dumps if you already have a set of start dumps, but they need some operation performed in order to make the start dumps for the ensemble members. A particular situation is if you want to use atmospheric conditions which are dumps from successive days of another integration; these dumps all need the validity time set back to the start time in the template job. In this case, e.g. if you have dumps matching pattern xaaaaa.daj0c?0, you might run something like
ens_dumps_from_multi.py xabcda.daj0c10 xaaaaa.daj0c?0 : reset_time 1990 12 01
where xabcda.daj0c10 is the template for the output filenames, and reset_time is the program doing the real work. Note the ":" on the command line; this is necessary to show the end of the list of filenames, which you could also have listed rather than use a wildcard. Note also that (unlike with ens_dumps_from_single.py) the ensemble size is not specified, because it is implied by the number of input filenames matched. The "1990 12 01" are arguments to reset_time (with the obvious meanings). Of course in this case, you would use the "-a" flag on "ens_gen.py".
(Aside: if doing a short run in order to generate a number of daily start dumps, it may be advisable to turn off climate meaning in that run, so as to make the dumps smaller. Alternatively see http://home.badc.rl.ac.uk/iwi/um/utils.html#subset)
Another situation where you might use ens_dumps_from_multi.py is if you have multiple start dumps that need no further processing, but are not named systematically. Here you could use the simple_link helper program, which just creates symbolic links. For example:
ens_dumps_from_multi.py mydump foo bar baz : simple_link
to create symlinks mydump_a => foo and mydump_b => bar and mydump_c => baz
Feature wish-list
Feel free to add to this, but also email me so that I am aware of it (AlanIwi)
-
Currently all members have to have the same start date. Allow for start dates to be read in from a file. Also perhaps allow for corresponding dump filenames to be read in from a file, rather than generated from the template filename. This is for sake of Doug's hindcast ensembles. In this case, maybe the runs need to have the same end date, and therefore we need to vary the run length similarly between ensemble members. DONE
- Features already added - note here that need to document them above:
- naming scheme for more jobs per ensemble
- support for ozone and sulphate ancil differences between ensemble members