Tuesday, April 13, 2010

MPIRun and " You may set your LD_LIBRARY_PATH to have the location of the shared libraries ...... " issues

The Scenario:
I encountered this error while executing an mpirun. Do a "pbsnodes -l" and everything seems is online. I thought my $LD_LIBRARY_PATH was giving the issues. But after some exhaustive check, I've realise that communication to one of our nodes was having issues. Here's are the steps I took to solve the issue

A daemon (pid 16704) died unexpectedly with status 127 while attempting to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the  location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.

The Error seems like it is due to LD_LIBRARY_PATH, but it may or may not.

Step 1: Check whether it is a LD_LIBRARY_PATH Issue for your head and compute node
First thing first, you should try to check whether you have the pathing of your LD_LIBRARY_PATH is blank or filled with the correct information for your head node and compute node.
$/usr/local/lib:/opt/intel/Compiler/11.1/069/lib/intel64 .....
If everything looks normal. Proceed to step 2

Step 2: Check whether the mpirun can be executed cleanly.
$ mpirun -np 32 -hostfile hostfilename openmpi-with-intel-hello-world
  1. hostfilename contains all the compute node host name
  2. openmpi-with-intel-hello-world is the compiled mpi program

Step 3: If the error still remains.....
Modify the hostfilename and insert 1 compute node at a time and compile the  mpirun. You should be able to quickly identify that the problem is not $LD_LIBRARY_PATH but a problematic compute node
. In my situation, my problem was due to a broken ssh-generated-key and despite my torque showing all nodes as healthy

1 comment:

Asma_hosna said...

What is broken ssh-generated-key and how you solve this.....I have same error as your's.....I couldn't figure out the exact error........do your server and all nodes works ssh without passward.....my server can connect through ssh without passward by it isn't working as vise-versa...