Setting up Postfix on the cluster
This section concerns e-mail communication between gorgon and the slave nodes relating to Sun Grid Engine (SGE) batch jobs. The -M and -m options for qsub can be used to tell SGE to e-mail one or more users about a batch job when it changes state. The e-mail is sent by the node controlling the job, not from gorgon, so the slave nodes have to be able to send messages to mail accounts on gorgon. The address to use is ${user}@master.beowulf.cluster, or simply ${user}@master. The host name "gorgon" can not be used in addresses and nor can the nerc-essc.ac.uk domain, because both are related to the network interface on gorgon that is connected to the outside world, not the internal interface connected to the cluster's internal network. External (i.e. non-cluster) e-mail accounts can not be reached.
IMPORTANT WARNING: Do not be tempted to set up Postfix on gorgon to use mercury as the relay host for external e-mail, to enable SGE messages to be sent to users' exernal addresses for example. The main reason not to do this is that Postfix sends all non-local messages via the relay host, including mail intended for the slave nodes (which is usually SGE related). The slave nodes are on the cluster's internal network and can not be reached from mercury. Until recently, Postfix on gorgon was actually set up to use mercury as a relay host, and this worked for a while without any trouble. The slave nodes were able to send messages to external addresses using gorgon as their relay host, but at some point something changed in the mail server configuration on mercury that prevented these messages from being sent. This probably happened when odin (ESSC's new main server) became the new mercury. After the change mercury no longer accepts incoming messages from hosts in the beowulf.cluster domain, and messages sent from hosts in this domain are bounced.
With mercury as gorgon's relay host now, SGE related messges from slave nodes fail to reach their destinations, and this can cause a more serious problem that affects mercury. The real problem lies in the fact that the bounce notification messages sent back to the slave nodes by mercury can not be delivered because of their unrecognised "From" addresses (e.g.
dab@node001NOSPAMPLEASE.beowulf.cluster), which results in a large number of messages accumulating in the mail file belonging to user " operator" on mercury and a lot of associated CPU load. This is suspected of having contributed to one or more of mercury's recent crashes. Changing the domain used for e-mail on the cluster to nerc-essc.ac.uk didn't help, because some system generated e-mail addresses are derived from the Linux domain (beowulf.cluster). Changing the Linux domain as well as the e-mail domain for gorgon and the slave nodes might work but would be a lot of hassle, and there might still be e-mail trouble because mercury doesn't recognise host names such as node001.nerc-essc.ac.uk. There are other plausible solutions too, such as forwarding to pegasus (which is able to send via mercury), transport tables or modifying senders' addresses to something recognisable by mercury (by putting entries in the server_canonical.db file on each machine), but after trying in vain for several hours to enable SGE messages to be sent externally via mercury without generating bounces I decided to return Postfix to its original, cluster only configuration.
The original Postfix configuration (as delivered by
ClusterVision ) didn't seem to work. The solution I eventually found involved changing the Postfix configuration on the slave nodes as well as on gorgon. The only file that needed to be changed on each machine was main.cf. Some brief notes on the key paramaters in this file are given below.
- inet_protocols: Sould be set to "ipv4" to avoid frequent warning messages in /var/log/mail.
- inet_interfaces: Set to "all"
- myhostname: Set to a fully qualified host name using the internal cluster name and domain of the machine in question, using "master" rather than "gorgon" and "beowulf.cluster" instead of "nerc-essc.ac.uk".
- mydomain: Set to "beowulf.cluster"
- relayhost: Leave blank i.e. "relayhost=" or comment out the line completely. The original main.cf on the slave nodes used gorgon as the relayhost.
- disable_dns_lookups: Set to"yes" to avoid unsuccessful searches for internal cluster host names.#
- mynetworks: I think this only comes into play if the host is being used as someone else's relay host. Just in case, the current working configuration has it set to "10.141.0.0/16, 127.0.0.0/8" on gorgon, undefined on the slave nodes.
- mynetworks_style: This parameter is overridden by the mynetworks parameter and is only used if the machine is a relay host. Just for the record, the current working configuration has it set to "subnet" on gorgon and "host" on the slave nodes.
Here are some other points of interest.
- Some of the parameters in main.cf are defined twice, because the files appear to have been automatically modified by a SuSE configuration utility at some stage.
- Messages sent to root on the slave nodes are automatically forwarded to root on gorgon (but I can't work out how this is done). The same is not true for ordinary users, but as far as I can gather, the only messages ever sent to ordinary users on slave nodes are bounce notifications. Therefore, provided Postfix and SGE are set up correctly, there should never be a need for users to see mail sent to slave nodes.
- Mail can be sent to external e-mail addresses from pegasus, but unfortunately this facility can not be used for SGE messages because these only go between gorgon and the slave nodes. Users can arrange for all e-mail sent to themselves on pegasus to be automatically forwarded to one or more external addresses, by putting the external addresses in a file called .forward in their home directory.
--
DanBretherton - 17 Jul 2009