Backups

-- AlanIwi - 22 Oct 2008

The quota-ed disk area on perseus may be useful to you for backing up certain files from gorgon / pegasus.

There is no cron job to do all the backups; you are responsible for your own user area. But to assist you, two utilities have been added, which you can use as the basis of making your own cron job.

  • do_backup
  • select_files

These work as follows. select_files will generate a list of files which should be backed up, by scanning your directories and interpreting files called .backups which give directions about what sort of files should be included or excluded. The utility do_backup takes such a list as input, and then actually writes the backups using rsync. It is a good idea to run select_files by itself first, directing the output to a file, to check that the file list produced by your .backups files looks reasonable (e.g. will fit within your quota on perseus).

select_files

-- AlanIwi - 22 Oct 2008

To run select_files on gorgon or pegasus, execute the following commands:

module add backups

select_files source_directory

The select_files utility will parse files called .backups in the top-level source directory and each sub-directory, and will write to standard output a list of files which are selected according to the rules specified in the .backups files.

To capture the output in a file for checking, execute a command like the following:

select_files source_directory > file_list.txt

Note that the output list will contain non-regular files if they match the rules, including directories, symbolic links, etc, although symlinks will not be followed.

do_backup

-- DanBretherton - 22 Oct 2008

To do backups of a particular disk area, execute the following commands on gorgon or pegasus.

module add backups

select_files source_directory | do_backup

e.g. select_files /home/users/foo/ | do_backup

This will back up those files within source_directory which are selected by select_files. It is important that source_directory is an absolute path (i.e. beginning with /) rather than a relative path. This is because of the way that do_backup arranges files in your backup directory. The paths to files in the backup partition are relative to /backup/perseus/users/$USER, where $USER is your user name. For example, the file /data/gorgon/users/$USER/.cshrc will be backed up to /backup/perseus/users/$USER/data/gorgon/users/$USER/.cshrc

The behaviour of do_backup can be altered by supplying extra rsync options as command line arguments. Two extra arguments you might find useful are -v, which produces a verbose output listing every file that needs to be transferred, and --dry-run (or -n) which works out which files will be transferred and their total size without actually doing the transfer. To run do_backup with these options execute the following command.

select_files source_directory | do_backup -v --dry-run

The --dry-run option is useful for checking the size of your proposed backup operation before you do it, to make sure it will not exceed your disk quota on /backup/perseus. To find out what your quota is, simply execute the command quota. Note that the extra rsync options you supply as arguments to do_backup are added to the options that are already there by default, which are -a and --stats (see man rsync for details).

Once you have setup your .backup files (see next secton) and performed a dry run of the backup, it is a good idea to set up a cron job to automatically backup your data every day. This is explained in the last section (Usage suggestions).

If the behaviour of do_backup doesn't suit your needs then you don't have to use it at all; it's just there for convenience. It is actually an embarrassingly simple Perl script, which simply pipes its standard input into rsync with some default rsync options. The default rsync options are all defined at the start of the script. This is the rsync command that is executed by default:

rsync -a --stats --password-file=/users/$USER/.rsync --files-from=- / perseus.nerc-essc.ac.uk::$USER

If you don't want to use the -a or --stats options you can either create your own version of do_backup with different default options or do away with it altogether. It is perfectly possible to execute a command like the following, without using do_backup at all.

select_files source_directory | rsync -a --stats --password-file=/users/$USER/.rsync --files-from=- / perseus.nerc-essc.ac.uk::$USER

This gives you the freedom to change all the rsync options as well as the source path (which is / by default in do_backup) and the destination. The paths in the file list supplied to rsync are all relative to the source path, which is why the source_directory argument to select_files must be an absolute path (i.e. relative to /) if you want to use do_backup. The destination used by do_backup specifies an rsync daemon module. The rsync daemon on perseus has a module for every cluster user, each corresponding to directory /backup/perseus/users/$USER. It is important to use an rsync daemon for backing up large files because it greatly reduces the amount of network traffic involved. Please ask Dan to set up additional daemons if you would like to back up to a different partition or a different server.

Format of .backups files

-- AlanIwi - 22 Oct 2008

Basic layout

.backups should be an XML file, containing a single top-level element called ruleset The ruleset element may contain a number of rule elements. Each rule element specifies the following (details below):

  • what files it matches
  • whether to select those files for backup or not
For each file found, the rules will be tested in sequence, and the first matching rule will determine whether the file is selected or not. If no matching rule is found, then the default is not to select the file.

Here are two equivalent examples of an XML file containing an empty ruleset. In both cases, no files in this directory will be selected, because of the default "don't select" behaviour.

  • <ruleset></ruleset>
    

  • <ruleset />
    

Rule syntax

Each rule element can have an attribute called select, which determines whether files which match the rule will be selected or not. This should have a boolean value ( true or false, but other keywords such as "yes" and "no", "on" and "off", "1" and "0" are also accepted), and defaults to true.

An empty rule will match all files, so here is an example of a .backups file which will select all files:

  • <ruleset>
        <rule select="true" />
    </ruleset>
    

or, given that select defaults to true, this will also select all files:

  • <ruleset>
        <rule />
    </ruleset>
    

And here is one which will select no files (now with an explicit "don't select" rule matching all files, rather than just an empty ruleset):

  • <ruleset>
         <rule select="false" />
    </ruleset>
    

Selecting files by size

Rules can have attributes max_size and min_size, which restrict what files they match. The following units are accepted (case-insensitive):

  • b / byte
  • k / kb / kilobyte
  • m / mb / megabyte
  • g / gb / gigabyte

So here is a .backups file which would select all files below 10Mb.

  • <ruleset>
         <rule select="true" max_size="10Mb" />
         <rule select="false" />
    </ruleset>
    

(the second rule above could be omitted because of the default no match rule)

or pretty much equivalently

  • <ruleset>
         <rule select="false" min_size="10Mb" />
         <rule select="true" />
    </ruleset>
    

All of the attributes of rule are also allowed to be sub-elements if you prefer, so for the above you would also be allowed:

  • <ruleset>
         <rule>
              <select>false</select>
              <min_size>10Mb</min_size>
         </rule>
         <rule>
              <select>true</select>
         </rule>
    </ruleset>
    

Selecting files by regular expression

Rules can also contain sub-elements called match containing regular expressions of files to match. Although an empty rule will match all files, as soon as it has any match elements, one of these regexps must match or the rule won't match. So here is an example which will select everything except for ".pyc" and ".o" files:

  • <ruleset>
         <rule select="false">
            <match>\.pyc$</match>
            <match>\.o$</match>
         </rule>
         <rule select="true" />
    </ruleset>
    

( match is allowed to be an attribute rather than a sub-element if you only have one regexp, but it's not recommended).

Rules can also contain no_match sub-elements. Any of these regexps matching will stop a rule from matching a file. So the following does the same as the above example:

  • <ruleset>
         <rule select="true">
            <no_match>\.pyc$</no_match>
            <no_match>\.o$</no_match>
         </rule>
         <rule select="false" />
    </ruleset>
    

Subdirectories

Ruleset propagation

When a valid .backups file is found in a given directory (e.g. your top-level directory), the default behaviour with subdirectories is to apply the same ruleset also to files in any subdirectory.

(Note that select_files must actually have been run on the higher-level directory. It will never scan directories above the given top-level directory looking for .backups files to apply.)

Ruleset overriding

Propagation of ruleset from a parent directory can normally be overridden by putting another .backups file in the sub-directory, containing the ruleset to use for files in the sub-directory. So for example, if you have:

  • Parent directory:
    <ruleset>
         <rule select="true" max_size="10MB" />
         <rule select="false" />
    </ruleset>
    

  • Sub-directory:
    <ruleset>
         <rule select="true" />
    </ruleset>
    

then only files under 10MB will be selected, except in the subdirectory (and its subdirectories), where all files will be selected.

Preventing overriding

A ruleset can have an attribute called allow_override, which defaults to true, but if set to false will disable processing of any .backup files in sub-directories. A useful special case of this is to totally exclude all files under a certain directory:

  • <ruleset allow_override="false">
         <rule select="false" />
    </ruleset>
    

  • or just
    <ruleset allow_override="false" />
    

In this special case, select_files does not even bother to scan below this directory, because there can never be any files to select. This gives efficiency savings compared to the same without the allow_override="false", where select_files must still recursively scan subdirectories for .backup files which might override it, even if it doesn't find any.

Ruleset merging

When a .backups file is found in a subdirectory, its ruleset will normally completely replace the ruleset used in the parent directory. It is, however, possible to insert the rules from the parent ruleset into the current ruleset by means of the inherit element. This should be placed as an empty element at the point in the ruleset where the parent rules are to be inserted. This is normally near the end.

If for example you have:

  • <ruleset>
         <rule select="true">
              <match>\.f$</match>
         </rule>
         <rule select="false">
              <match>^core</match>
         </rule>
         <inherit />
    </ruleset>
    

this would ensure that all ".f" files are selected and no core files are, but otherwise apply whatever ruleset has been specified in the parent directory.

(Note that if no higher-level .backups files have been encountered, inherit will still be considered valid but will do nothing.)

Usage Suggestions

-- DanBretherton - 22 Oct 2008

Once you have set up your .backup files and performed a dry run to make sure your backup will not exceed your disk quota, it is a good idea to set up a cron job to run the backup once a day. To set up a cron job you must edit your crontab file, using the command crontab -e. Unfortunately this means using the vi editor (or specify your favourite editor in environment variable VISUAL). In my crontab file on pegasus I have the following lines, which backup my data on gorgon, pegasus and perseus every evening, at 18:01, 20:02 and 22:03 respectively.

1 18 * * * /usr/local/Cluster-Apps/backups/bin/select_files /data/gorgon/users/dab | /usr/local/Cluster-Apps/backups/bin/do_backup -v > /home/users/dab/do_backup-gorgon.log
2 20 * * * /usr/local/Cluster-Apps/backups/bin/select_files /data/pegasus/users/dab | /usr/local/Cluster-Apps/backups/bin/do_backup -v > /home/users/dab/do_backup-pegasus.log
3 22 * * * /usr/local/Cluster-Apps/backups/bin/select_files /data/perseus/users/dab | /usr/local/Cluster-Apps/backups/bin/do_backup -v > /home/users/dab/do_backup-perseus.log

Notice that the output of each backup command is directed to a log file in my home directory. This allows me to check occasionally that the backup is being performed correctly. Notice also that all the paths are absolute paths. This is because the cron daemon does not execute a user's .cshrc or .bashrc script before executing the commands in their crontab file. In other words, your crontab commands can not make use of your environment variables and aliases.

I thought it might be useful to show another couple of .backup file examples. The .backup file I use in each of my cluster homes (i.e. the users/dab directories) is shown below.

<ruleset>
<rule select="false">
<match>\.iso$</match>
<match>\.dimg$</match>
<match>\.dimgproc.+</match>
<match>^enact.*\.nc$</match>
<match>^ENACT.*\.nc$</match>
</rule>
<rule select="true" />
</ruleset>

This selects all files except those ending in ".iso" and ".dimg", files ending in ".dimgproc" plus one or more other characters, and netcdf files beginning with "enact" or "ENACT". In each of the directories I use for model testing output I have a .backup file containing the following lines, which prevent any files in those directories and their sub-directories from being selected.

<ruleset>
<rule select="false" />
</ruleset>

You will notice that the amount of data in your backup area increases with time, because files that are deleted in your working directories are not deleted from your backup directories. This safeguards against accidental deletion, but it does mean that your backup area will need periodic maintenance in order to clear out files that are no longer needed.

Topic revision: r9 - 23 Oct 2008 - 11:22:30 - AlanIwi
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback