Data storage and transfer

Avoiding I/O performance problems

To reduce the impact of unresponsive filesystems, interactive work, including filesystem browsing and editing files, should be done using applications running on your office computer wherever possible, or on the Suns in the computer lab (e.g. atlas and hydra). This helps to avoid the effects of high disk I/O and CPU load on the data servers, and takes advantage of NFS client metadata caching to improve performance. Only large data processing and transfer tasks should be performed when logged on to the servers themselves, by following these guidelines.

It is also important to keep all your editable files in your ESSC home directory, and definitely not on the same storage volumes as the large binary data sets (e.g. /data/perseus etc). The ESSC home directories are accessible under /users on all the cluster related servers (gorgon, pegasus etc.). If you are used to finding certain files in the /data and /nemo directories you can put symbolic links to their new locations in /users. To avoid the effects of heavy server load altogether it is a good idea to use ESSC computers to edit files instead of gorgon and pegasus etc. These computers are the office desktop PCs and the Suns in the computer lab. All the ESSC linux desktop PCs should have vi, gedit and kwrite installed; if not please ask. MATLAB can be run on the ESSC computers too, although they are not all powerful enough to run scripts for processing large data sets. All the cluster servers' data is mounted on the ESSC computers as /data/pegasus etc, and the NEMO filesystem on romulus (i.e. /nemo/romulus on the cluster servers) is mounted as /data/nemo on ESSC computers. Remember that you will not get the full benefit of migrating to ESSC computers for editing unless you actually move your files into your ESSC home directory.

Another advantage of moving editable files off the heavily loaded cluster servers is that the ESSC homes are backed up regularly. It is important to remember that files on gorgon, pegasus, perseus, romulus and remus are not backed up unless you do it yourself. Remember also that you can set up automatic backups by following the instructions on the wiki - here: http://www.resc.rdg.ac.uk/twiki/bin/view/GCEP/Backups.


To minimise the effects on other users please use ionice (see "Running data processing applications and utilities" ) for all large data processing and transfer tasks. Using ionice really does make a difference, but it does not eliminate slow response times completely. This is mainly because interactive processes that have been idle for some time can take a minute or two to wake up and assert their higher disk priority, but after that they keep their high priority as long as they remain in frequent use. An example of this is "ls -l" taking a minute or two to respond, but when it does you can usually enter the same command again without difficulty. A similar affect can be seen when using an editor that saves to disk only occasionally. I understand that waiting two minutes for a response from the filesystem can be frustrating, but without ionice the delay could be more like twenty minutes or even longer.
Even if interactive response times are not excessive, the time taken to run data processing tasks can vary a lot from day to day. This is the result of variations in load, and is inevitable when only one server has a signifiant amount of free space (i.e. it is often the case that many people need to use that particular server at the same time). When a server's disks are busy, delays to data processing tasks are more likely to be caused by write operations rather than read operations, because write operations are usually much slower. If you would like to understand why a particular server seems slower than usual on a particular day, the "top" utility can be used to show how many processes are sharing the resource. To show waiting processes, launch top, highlight the sort field by pressing 'x' then 'b', move the sort field left to 'S' (for "state") by pressing the '<' character and press 'R' (capital R not lower case) to reverse the sort. This will show you all the processes that are spending most of their time in the 'D' state, i.e. waiting for I/O. The more processes that are in state 'D', the slower each one will be.

Storage volumes

The table below describes all the storage volumes associated with the cluster.

Host name
gorgon
pegasus perseus
romulus
remus
romulus and remus behemoth
Description

Cluster head node. Use for file browsing and editing and submitting and monitoring batch jobs. Do not run CPU or disk intensive applications.

       

Dedicated server for Godiva2 data visualisation service
Rsync access from outside Reading firewall (host:port) Via pegasus
  • NOC - nautilusa:3000
  • PML - medusa1:3000
  • POL - livgen:3000
  • NOC - nautilusa:3100
  • PML - medusa1:3100
  • POL - livgen:3100
  • NOC - nautilusa:3200
  • PML - medusa1:3200
  • POL - livgen:3200
  • NOC - nautilusa:3300
  • PML - medusa1:3300
  • POL - livgen:3300
   
First storage volume              
  • Local path
/home /local /local /local /local   /local
  • Network path
/data/gorgon /data/pegasus /data/perseus /data/romulus /data/remus /backup/cluster Not exported via NFS
  • Size
3.0 TB 5.5 TB 7.8 TB 3.3 TB 16.0 TB 3.0 TB 3.5 TB
  • Comment
Home directories         A storage volume spanning both servers (work in progress). Uses 0.5 TB on romulus and 2.5 TB on remus. Will initially be used to replace the backup partition on perseus Only accessible via FTP or rsync
Second storage volume              
  • Local path
    /local2 /local2      
  • Network path
    /backup/perseus /nemo/romulus      
  • Size
    5.0 TB 15.0 TB      
  • Comment
    Backups NEMO users only. No quotas.      

Please note the following:

  • Only the paths beginning with /data are visible on the main ESSC network filesystem.
  • The path /home refers to the same partition on goron as /data/gorgon. This is true if you are logged on to gorgon or one of the other servers.
  • On all servers except gorgon, if you are logged on to a server named ${SERVER} the path /local refers to the same partition as /data/${SERVER}.
  • If you are logged on to perseus the path /local2 refers to the same partition as /backup/perseus.
  • If you are logged on to romulus the path /local2 refers to the same partition as /nemo/romulus.

Quotas

To find your user disk quotas for all partitions on a particular machine, log on to that machine and enter the command "quota -slu". Here is an example of the output from this command, entered by user dab on pegasus.

Disk quotas for user dab (uid 28001):
Filesystem blocks quota limit grace files quota limit grace
/dev/sdb1 19756M 0 21611M 1647 0 0

On filesystems where no quotas are enforced you can check your usage by using the repquota command instead of quota. For example, the following command lists filesystem usage for all users on the filesystem mounted as /local.

/usr/sbin/repquota -s -u /local

The local paths (i.e. /local, /local2 etc) must be used instead of the network paths (i.e. /data/pegasus etc). To find which local path corresponds to a network path on a particular machine you can use a command like the following.

[dab@pegasus ~]$ mount | grep "/data/pegasus"
/local on /data/pegasus type none (rw,bind)

The output of that particular command on pegasus shows that the local path "/local" corresponds to network path "/data/pegasus".

There are several important fields in each record in the output from quota:

  • Filesystem: This is identified by the device name not the mount point. To find the mount point corresponding to a device name enter the command "mount".
  • blocks: This is the amount of data you have in the partition (i.e. on the file system). If you use the -s option as in the example above, the amount of data will be in sensible (i.e. human readable) units such as megabytes (identified by 'M') or gigabytes ('G').
  • limit: There are two of these, the first for amount of data and the second for number of files. These columns give your hard limits, the maximum amount of data and number of files you are allowed to keep on the filesystem. Note that a zero in these columns means unlimited.
  • files: The number of files that belong to you on the filesystem. Please do not let the number of files get too large. If there are any more than about half a million in separate files in a filesystem the impact on performance can be serious. There are no quotas on numbers of files at the moment but please try to keep the number belonging to you below 100,000.
  • The "soft" and "grace" columns are not used at the moment bacause we are not using soft quotas.

The output from repquota is slightly different. Here is an example command showing the first few lines of output (which then continued to list all the other users).

[dab@romulus ~]$ /usr/sbin/repquota -s -u /local2
*** Report for user quotas on device /dev/mapper/volgroup1-logvol2
Block grace time: 7days; Inode grace time: 7days
Block limits File limits
User used soft hard grace used soft hard grace
----------------------------------------------------------------------
nobody -- 345 0 0 3 0 0
root -- 4783 0 0 22 0 0

There are two sets of columns: "Block limits", on the left shows the amount of data used ("used" column) and the quota ("hard" column), and "File limits" on the right shows the same information for number of files.

To find your group quotas for all partitions on a particular machine, log on to that machine and enter the command "quota -slg". This lists quotas for all the groups you are a member of. To list quotas for a specific group (e.g. nemo) enter the command "quota -slg ${GROUP}", where ${GROUP} represents the group name. Alternatively, you can use repquota with the -g option instead of -u to show group information.

If you leave out the -l option in the quota command it will attempt to get quota information from all the NFS servers with data mounted on the machine you are logged on to. This can be useful, but it can take a long time if one or more machines (which you may not be interested in) are slow to respond. Also bear in mind that the remote quota daemons may not all be running.

Data transfer guidelines

Disk intensive processes, performing tasks such as data transfer and processing, can have a significant effect on the performance of computer systems. When a storage volume is being heavily used the effects are most noticable to users running interactive processes such as text editors and command shells. On the cluster's storage servers, responsiveness can be reduced to the extent that interactive work is impossible when disk intensive processes are running. The high input-output (I/O) load on the disks associated with the cluster's batch jobs (when they are reading and writing data via NFS) is unavoidable, but other tasks can be performed in a way that significantly lessens their impact on interactive users. A common type of disk intensive task is data transfer, including transfer between servers, transfer between partitions on the same RAID volume and copy operations between directories on the same partition. Some guidelines for performing data transfer in a disk friendly way are given below. Data processing operations are dealt with in a separate topic.

Please don't use cp or mv for large transfer operations, "large" being more than a few gigabytes. The mv command should be avoided in any case, unless you are moving data to a different directory in the same file system on the same partition, because the results of an unexpected interruption (in the event of a power cut for example) can be unpredictable. Running cp with ionice (see "Running data processing applications and utilities" ) will help only if you are copying data to a different location on the same host, but if you are copying between two different hosts then NFS will have a significant impact on the remote host. It would not be wise to run NFS with ionice because this would also affect interactive work such as editing and shell commands, and could also slow down batch jobs running on the cluster's slave nodes.

The best alternative to cp is rsync, which will be described later, but FTP can also be used. A popular FTP client is ncftp, the only FTP client I know about that supports recursive directory transfer. See "man ncftp" for usage instructions. The ncftp client does transfer symbolic links, but only if they point to locations contained within the directory tree being transferred. If you want to transfer a directory tree containing links to other locations then rsync is a better option. All rsync and FTP transfers are run with ionice by default, so it is not necessary to include "ionice" in data transfer commands.

Introduction to rsync

The powerful rsync utility is my preferred method of data transfer. Not only is it useful as a disk friendly alternative to cp over NFS, but can also greatly assist data transfer to/from remote hosts outside the Reading Univeristy campus firewall. To those new to rsync it can seem rather cumbersome compared to cp or FTP, but once familiar with the syntax and the important options it can be used just as easily. Apart from reducing the impact of data transfers on other users, the benefits of using rsync include increased reliability (compared to cp or scp) and the ability to resume an interrupted transfer. Full usage instructions can be found in the man pages (i.e. by entering the command man rsync).

To avoid using NFS for rsync transfers always specify files and directories on a remote host using an rsync URL rather than as an NFS path. The only rsync daemon that can't be used by ordinary users is gorgon's, because it is set up exclusively for the cluster's slave node booting process. To transfer data to or from gorgon using rsync you must be logged on to gorgon. The other servers' rsync daemons have a set of modules named ${USER}_local, where ${USER} represents the user names. There is a ${USER}_local module for each user, which corresponds to path /local/users/${USER}. In addition to the ${USER}_local modules, romulus has a ${USER}_nemo module corresponding to the /nemo/romulus/users/${USER} directory. Perseus also has a module named${user} allocated to each user, for use by the backup system. To list the rsync modules available on the rsync daemon running on machine ${HOST_NAME}, simply enter the following command.

rsync rsync://${HOST_NAME}

Some examples of rsync usage specific to the ESSC cluster are given below.

Using rsync for data transfer within ESSC (on the same server and between two different servers)

Here is an example of a command for transferring some netCDF files from a directory on gorgon to the /nemo/romulus partition on romulus. The command was executed by user dab on gorgon.

rsync -avh --password-file=$HOME/.rsync --progress *.nc rsync://romulus/dab_nemo/NEMO/DEFAULT_ORCA025

This assumes that the rsync password is in a file called .rsync in the home directory. Please let me know if you do not have a .rsync file in your home directory. If a command like the above is interrupted, whether accidentally or deliberately, the transfer can be resumed simply by issuing the same command again. You can also do this to check that a large transfer was successful, even if it appears to have finished normally the first time. Adding the "--dry-run" option lists all the files that will be transferred and their total size, without actually transferring any data. It is a good idea to do a dry run before a large transfer to make sure the destination directory has space.

Note that the -h option, which gives the post transfer report in human readable units, is not available with older versions of the rsync client such as those found on many of the ESSC desktop PCs.

Here is another example. The command below transfers the contents of directory /data/romulus/users/rim/WORK/ORCA025 on romulus to /nemo/romulus/users/rim/WORK/ORCA025, a directory on another partition on romulus. This command was executed by user rim on romulus.

rsync --password-file=$HOME/.rsync -avh --progress rsync://romulus/rim_local/WORK/ORCA025/ /nemo/romulus/users/rim/WORK/ORCA025

In this example, the source is specified by an rsync URL instead of the destination.

Using rsync for data transfer between the ESSC cluster and remote hosts outside the Reading University firewall

I have been using rsync instead of scp or sftp for transfers between ESSC and other NERC centres. So far rsync seems to be more reliable, with the added advantage of being able resume interrupted transfers or repeat a transfer command to make sure all files were transferred correctly the first time. I have set up rsync-only accounts for various external groups including CPOM at UCL (part of NCEO), POL, PML and NOC. It is safe to share rsync-only account details with groups of external colleagues because the rsync accounts do not correspond to usable shell accounts at ESSC (and the rsync passwords are different anyway). Pleas ask if you would like to set up a data transfer arrangement with an external group that is not mentioned above.

There are two ways to access to rsync at ESSC through the Reading University firewall.

  1. Via SSH tunnels. I have set up SSH tunnels between the cluster's servers and the following NERC centres: POL, PML and NOC. This allows direct access to rsync on pegasus, perseus, romulus and remus from any of these sites. Host names and port numbers to be used at these sites are given in the table at the start of this article. One advantage of using tunnels is that it allows direct transfers between servers that are not accessible via SSH from outside. For example, I can transfer data using rsync between romulus and one of the NOC cluster head nodes. The only other way to do this using scp or sftp would be via hydra using NFS, but external SSH access to hydra may not always be available. I can set up tunnels to any external system where I have a shell account; please ask me if you want me to do this for another site not listed above, as long as someone there is willing to set up an account for me. Alternatively I can give instructions for setting up and maintaining SSH tunnels yourself using your own user accounts. Please ask me for more details.
  2. Through an encrypted firewall port open to behemoth. IT Services have agreed to open a firewall port for a single rsync daemon as long as the network traffic is encrypted. Unfortunately rsync does not support encryption, so the only way to access behemoth's rsync daemon from outside without a SSH tunnel involves an encryption utility called Stunnel. The Stunnel daemon on behemoth decrypts they encrypted rsync traffic before sending it on to pegasus. This option can be used to give rsync access to external collaborators who do not have user accounts at ESSC, and where ESSC users do not have accounts on the collaborator's system (which would enable SSH tunnels to be set up). The CPOM group at UCL is an example. Stunnel is not difficult to install and set up; I will write some instructions on the wiki when I get time. The rsync daemon on pegasus has ${USER}_local modules for read and write access to each user's local storage on pegasus, but also gives read-only access to most of the other data at ESSC via NFS. The read-only modules are called data, home and nemo. The use of NFS means that data transfer through pegasus' rsync daemon is not disk friendly for transfers involving data on servers other than pegasus. If this facility starts to be used frequently I may have to think again how to allow data transfer from outside Reading University in cases where SSH tunnels can not be used.

It is possible to browse the file systems at ESSC via rsync, although browsing is not as easy as it is with sftp. The contents of a directory can be listed by issuing an rsync command without any file selection options (e.g. -avh) or destination. The following example, with its corresponding output, is a directory listing command that was issued by a user at NOC using the "dab" rsync account.

nautilusb:/fibre/dab/NEMO/FORCING_ORCA025/ERAINT $ rsync --password-file=$HOME/.rsync rsync://dab@nautilusa:3200/dab_nemo/grex/main/
drwxrwsr-x 96 2009/05/10 21:47:10 .
drwxrwsr-x 144 2008/09/25 17:43:37 NEMO
drwxrwsr-x 480 2009/05/10 21:50:02 wd

Note the trailing forward slash at the end of the path. Without it the command just returns the directory name rather than its contents. The host at NOC from which rsync daemons can be accessed is called nautilusa, and the port number for romulus is 3200. After several more commands like this, it is possible to find the right directory and then proceeed to transfer some or of the files in it. The examples below show the final directory listing command followed by a file transfer command.

nautilusb:/fibre/dab/NEMO/FORCING_ORCA025/ERAINT $ rsync --password-file=$HOME/.rsync rsync://dab@nautilusa:3200/dab_nemo/grex/test/NEMO/PWD_ORCA025/forcing/*_1992.nc
-rw-rw-r-- 94226844 2009/05/10 23:05:25 precip14_ERAINT-ORCA025_1992.nc
-rw-r--r-- 82448620 2009/05/10 22:29:53 precip_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 31 2009/05/10 23:05:25 precip_core_1992.nc
-rw-r--r-- 8633468280 2009/05/11 01:33:44 q2_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 25 2009/05/11 01:33:45 q2_core_1992.nc
-rw-r--r-- 2167201408 2009/05/10 23:25:31 qlw_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 26 2009/05/10 23:25:34 qlw_core_1992.nc
-rw-r--r-- 2167201404 2009/05/10 23:29:22 qsw_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 26 2009/05/10 23:29:32 qsw_core_1992.nc
-rw-rw-r-- 94226832 2009/05/11 13:04:35 snow14_ERAINT-ORCA025_1992.nc
-rw-r--r-- 82448608 2009/05/11 12:24:16 snow_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 29 2009/05/11 13:04:35 snow_core_1992.nc
-rw-r--r-- 8633468272 2009/05/11 02:25:39 t2_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 25 2009/05/11 02:25:40 t2_core_1992.nc
-rw-r--r-- 8633468228 2009/05/11 22:20:22 u10_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 26 2009/05/12 11:46:57 u10_core_1992.nc
-rw-r--r-- 8633468228 2009/05/11 22:20:24 v10_ERAINT-ORCA025_1992.nc
lrwxrwxrwx 26 2009/05/12 11:49:59 v10_core_1992.nc

sent 64 bytes received 759 bytes 1.65K bytes/sec
total size is 39.22G speedup is 47656897.86
nautilusb:/fibre/dab/NEMO/FORCING_ORCA025/ERAINT $ rsync -avh --password-file=$HOME/.rsync rsync://dab@nautilusa:3200/dab_nemo/grex/test/NEMO/PWD_ORCA025/forcing/*_1992.nc .

Another useful rsync feature is the ability to compare source and destination copies of files by calculating their checksums. This can be achieved with rsync by using the -c option instead of, or in addition to other file selection options such as -a. In the example below, which was issued by a user at NOC, some netCDF files at ESSC are compared to copies that were previously transferred to NOC. The --dry-run option is used so that the command only lists files that would have been transferred (i.e. those whose source and destination checksums differ).

rsync --dry-run -c --password-file=$HOME/.rsync rsync://dab@nautilusa:3200/dab_nemo/grex/test/NEMO/PWD_ORCA025/forcing/*_ERAINT-ORCA025_1990.nc .

In this example the file sizes and modification times are ignored, because -c is the only file selection option used. Had the -a option been used (as in all the previous examples) the file q2_ERAINT-ORCA025_1990.nc would have been selected for transfer bacause the modification time of the NOC version is more recent. However, with the -c option the file was not selected for transfer because the ESSC and NOC versions of the file are identical. This checkusum facility is useful in situations where a corrupted file is suspected of causing a model to crash, even though all the files used by the model appear to have been transferred correctly based on their file sizes and modification times. Using rsync with the -c option can identify which file or files (if any) were not transferred correctly the first time.

Note that using the -c option can be very time consuming, particularly if the files being compared are large. It is not necessary to use the -c option in order to prevent file modification times from being used in the file selection algorithm. Using the --size-only option in conjunction with other file selection options such as -a removes modification time from the file selection algorithm. This can be useful when using rsync to compare source and destination copies of a directory that was originally transferred with scp (without the -p option) or another file transfer utility that does not preserve modification times. Without the --size-only option, rsync would decide that every file in the destination directory needed to be transferred again, even if only a few files in the source directory were different or new.

-- DanBretherton - 06 Jul 2009

Topic revision: r17 - 07 May 2010 - 12:11:13 - DanBretherton
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback