Running a memory test on a node
Two popular memory testing utilities are installed on the cluster:
Memtest86 and
Memtest86+. To run a memory test with either of these, the node has to boot with the required image. All the boot image files available for the nodes are in directory /slave/images/default-image/boot on gorgon. The two we are interested in are memtest_3.2 and memtest_1.65. The nodes are set up to boot over the network from gorgon using
PXELINUX, an implementation of the Intel
PXE network booting specification. The choice of image used to boot the nodes is controlled by the files in directory /slave/images/default-image/boot/pxelinux.cfg. Usually there will only be one file in this directory, called "default", which is used by all the nodes if there are no node-specific files in the directory. Below is the contents of default at the time of writing.
DISPLAY bootoptions.txt
TIMEOUT 4000
#SERIAL 0 115200
LABEL linux
KERNEL vmlinuz
IPAPPEND 1
#APPEND init=/linuxrc initrd=initrd console=tty0 console=ttyS0,115200
APPEND init=/linuxrc initrd=initrd
LABEL
H8DAE KERNEL memdisk
APPEND initrd=sm_h8dae_50135_3.img
LABEL MEMTEST
KERNEL memtest_3.2
LABEL MEMPLUS
KERNEL memtestplus_1.65
DEFAULT linux
This file is structured in a similar way to the boot menu files in Lilo and Grub, with a list of named options and a default selection at the bottom. In the above example the default option is "linux", which boots the current Linux kernel. The easiest way to make a node boot with memtest_3.2 or memtest_1.65 is to edit file "default", and change the bottom line to either "DEFAULT MEMTEST" or "DEFAULT MEMPLUS". When the node is restarted it will start running the memory tests automatically. The only way to see the results of the tests is to plug the monitor into the node and view the output. The test utilities do not write any information to the hard disk, and no information is available over the network. Note that (at the time of writing) there is a ten minute delay in the boot process caused by an interesting feature in the BIOS code. Just after the power is switched on the BIOS performs it's usual cursory memory check and then appears to go into suspended animation for ten minutes for no apparent reason.
It is important to remember to change the default boot image setting to "linux" after the node or nodes have started their tests. To avoid changing (and perhaps forgetting to change back) the "default" file, it is possible to select one of the options from the above menu using the keyboard before the boot process starts. To go into manual selection mode the Caps-Lock key has to be pressed at a certain moment just after power-on, usually while the screen is still blank. If the BIOS starts its memory check before you have pressed "Caps-Lock" then you are probably too late, but there is no way of knowing until after the ten minute sleep. If you were successful with the "Caps-Lock" moment then the menu options in "default" can be selected using the up and down arrow keys, but the time-out period is only about a minute so you can't go away and do something useful during the ten minute pause. This can be very tedious indeed, especially if you find out that you didn't press Caps-Lock at the right time.
Another way to select the boot image for a particular node is to put a node-specific file in directory /slave/images/default-image/boot/pxelinux.cfg. The name of this file is the hexadecimal representation of the node's IP address, which can be obtained using the gethostip command. For example, the command "gethostip node004" produces the following output.
node004.beowulf.cluster 10.141.0.4 0A8D0004
The last field in this line of output is what we want. To select the boot image for node004, create a file called 0A8D0004 in /slave/images/default-image/boot/pxelinux.cfg (with "touch 0A8D0004" for example) and then copy the contents of "default" to this file. Change the bottom line of 0A8D0004 to the desired option from the menu (e.g. "MEMTEST" if you want to boot the memtest_3.2 image) and then restart the node.
--
DanBretherton - 17 Jul 2009