Troubleshooting rogue processes

SGE jobs sometimes leave begind rogue processes, which continue to run outside the control of SGE after the job has finished running. This can happen when a job crashes or after a job has been deleted by the user or by SGE (after exceeding the time limit for example). Rogue processes can be suspected in the following situations:
  • A node has had a load level of 1 for more than a few minutes as it appears in Ganglia.
  • A job crashes because there aren't enough Myrinet endpoints on a particular node.
  • A job is terminated by SGE because it has exceeded it's run time limit, even though similar jobs normally finishe comfortably within that time.
If you suspect that there are rogue processes on a node, the first thing to do is look at the full process listing with " ps aux ". Don't just do top, because this only lists processes that are consuming CPU time by default. Rogue processes can be holding onto Myrinet endpoints without consuming any CPU time. The command "qhost -j" gives detailed information about each host, including information about the jobs it is running, if any. This should enable you to tell what should and should not be running on any given node.

If you suspect that one of your jobs has left rogue processes behind, it is a good idea to check all the nodes it was running on. Some of the shell scripts described under "Additional command utilities" in the Usage Guidlines section of this wiki are useful for finding and deleting rogue processes. Use which-nodes.sh and check-nodes.sh to find out where your job was running and what processes of yours are still running on those nodes. If you have forgotten the SGE job number of a job that has finished, you can find it in the output from " qstat -s z ", which gives details of all jobs that ran in the past two days. The terminate-method.sh script can be used to delete all your processes that are running on all the nodes where your suspect job was running.

The terminate_method.sh script deletes jobs using a three stage process. Firstly, it attacks the user's processes on the main node controlling the job, killing them with SIGTERM via kill-my-procs.sh. This is an approximation to what SGE does normally. Next, terminate_method.sh runs kill-my-procs.sh with SIGTERM on the other nodes where the job is running. By this stage the Myrinet signal passing mechanism has usually already cleaned up the other nodes, but there are occasionally processes left to kill. Finally, after a pause to allow the SIGTERM to take effect, terminate_method.sh goes round again using SIGKILL instead of SIGTERM. Nothing could survive this; I'm thinking of changing the name of the script to domestos.sh ("... kills all known germs, dead."). A log file containing informative output is copied to the user's home directory.

Myrinet is supposed to take care of job deletion itself, without the need for tight integration with SGE. The Myrinet processes that control a parallel job (mpirun.ch_mx.pl, I think) are responsible for passing on any signals they receive to the other processes associated with the job on all the nodes where the job is running. There are usually two instances of mpirun.ch_mx.pl associated with every job, both of which run on the main node, i.e. the controlling node that is specified in the output of qstat. In theory, all that needs to be done to delete a job is to send a kill signal like SIGTERM (No. 15) to the mpirun.ch_mx.pl processes on the main node. This is what SGE does I believe. This mechanism does seem to be working; our problems might be caused by processes that do not respond to the SIGTERM signal properly. If this is the case, rogue processes might still occur even if we had tight integration with SGE. Incidentally, I tried using SIGKILL (No. 9) instead of SIGTERM but it didn't work; qdel always left some processes running. This might be because mpirun.ch_mx.pl doesn't like SIGKILL; SIGTERM might be preferred because it can be trapped by processes, allowing them to exit gracefully.

So what can we do to prevent rogue processes being left behind when jobs are deleted? One approach is to change the way that SGE deletes jobs. This behaviour is configurable via the terminate_method queue attribute. This can be set to a value that represents a signal such as SIGTERM, as it is at the moment. Alternatively, it is also possible to specify a command to be executed instead of SGE's own delete mechanism. I have set up a test queue called SGE_Testing.q, which uses terminate_method.sh for job deletion. This means that the qdel command actually invokes terminate_method.sh. To use this queue, add " -l sge_t " to the qsub command. I think that terminate_method.sh needs a lot more testing before being applied to all the queues. There are bound to be situations I haven't thought of that cause it not to work properly. In the meantime, I suggest that we delete jobs with a terminate_method.sh command instead of using qdel. Alternatively, the script could be used to clean up after a qdel or after a time limit deletion.

-- DanBretherton - 17 Jul 2009

Topic revision: r1 - 17 Jul 2009 - 17:33:30 - DanBretherton
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback