Correcting Myrinet mapper problems
If you suspect that Myrinet problems are causing jobs to crash on a particular node, the first thing to do is to check that the Myrinet card in that node can see all the other nodes. The output of the command
mx_info should look like this:
MX Version: 1.1.1
MX Build:
root@gorgon:/usr/src/packages/BUILD/mx-1.1.1 Wed Mar 1 12:45:12 CET 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===================================================================
Instance #0: 224.9 MHz LANai, 132.9 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address: 00:60:dd:48:0e:6b
Product code:
M3F -PCIXD-2 V2.2
Part number: 09-03034
Serial number: 267207
Mapper: 00:60:dd:48:13:6f, version = 0x003873b8, configured
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:48:0e:6b node005:0 1,1
1) 00:60:dd:48:0e:c8 node001:0 1,1
2) 00:60:dd:48:0e:9b node003:0 1,1
3) 00:60:dd:48:0e:e2 node007:0 1,1
4) 00:60:dd:48:0f:07 node009:0 1,1
5) 00:60:dd:48:0e:d8 node011:0 1,1
6) 00:60:dd:48:13:6b node013:0 1,1
7) 00:60:dd:48:0e:89 node015:0 1,1
8) 00:60:dd:48:0e:e6 node002:0 5,2
9) 00:60:dd:48:0e:ca node006:0 6,2
10) 00:60:dd:48:0e:81 node010:0 6,2
11) 00:60:dd:48:0e:d1 node004:0 6,2
12) 00:60:dd:48:0e:e3 node012:0 4,2
13) 00:60:dd:48:0e:9a node014:0 6,2
14) 00:60:dd:48:0e:9f node016:0 5,2
15) 00:60:dd:48:13:6f node008:0 5,2
If one or more of the nodes are missing from the list in this output then something might have gone wrong with the FMA mapper (whatever that is). This often happens when nodes are restarted. The ability to
ping other nodes is not related to Myrinet connectivity. The ping response goes through ethernet not Myrinet, so the only way to test Myrinet connectivity is using
mx_info.
The Myrinet mapping problem can usually be rectified as follows.
- Login to the node as root with rsh.
- Start qmon and remove the node you are working on from the host list @allhosts. To do this select "Host Configuration", go to the "Host Groups" tab, select @allhosts and click the "Modify" button. Select the offending node from the list and delete it using the bin shaped button. When you have finished press "Done".
- Check to see if any jobs are running on the node using the command " qhost -j ". If a job is running on the node you must wait for it to finish before proceeding any further.
- Run the mx_start_mapper script and wait for a few minutes. Try mx_info again to see if all the nodes are listed.
- If not, the next thing to try is explicitly stopping the mapper processes by running the mx_stop_mapper script.
- Do " ps aux | grep mx " to make sure that no Myrinet related processes are still running; if there are any, kill them.
- Run the mx_start_mapper script,then wait a minute or so for it to do its stuff
- Run mx_info again to check that all the nodes are listed.
If
mx_start_mapper doesn't find all the other nodes, another option is to
mx_stop_mapper again, and then launch
mx_mapper in the background with the command "
mx_mapper & ". The
mx_info and
mx_mapper commands are in the executables path for root on all the nodes; if you want to run
mx_info as an ordinary user you can find it in directory /usr/local/Cluster-Apps/mx/mx-current/bin/.
If none of these procedures fix the problem it is likely to be caused by one of the other nodes. It is important to remember that the Myrinet mapping is two way. For example, if node001 can see node002 through Myrinet, it does not necessarily follow that node002 can see node001. If the node you are working on consistently fails to map to one particular node then try performing the above procedures on the other node too, even if
mx_info on the other node shows full connectivity. Remember to take nodes out of the @allhosts host list before running any "mx_" commands.
If none of these tricks solve the problem, a suite of Myrinet test utilities can be found in directory /usr/local/Cluster-Apps/mx/mx-current/bin/tests, which contains an informative README file. A good one to start with is
mx_peek_test, which tests communication between two nodes. In order to avoid affecting other jobs while these communication tests are being carried out, check that both nodes involved do not have anything running on them, and be removed from the SGE host list to prevent new jobs from being submitted to them. This is necessary because the tests take up Myrinet endpoints.
When the Myrinet mapping has been restored, remember to add all the nodes back to the @allhosts host list.
--
DanBretherton - 17 Jul 2009