Monday, November 2, 2009

Jobs submitted placed on the queue and not able to run on HPC

Sometimes, when out users submit job via MOAB using Torque as a resource manager, their jobs is just not able to run even though there are resources to accomdoate the run. Of course there are many ways to troubleshoot. Here is how I troubleshoot 1 of them.

Step 1: Use the command checkjob on the process ID of the job that is stuck.
# checkjob 100001
--------------------------------------
Partition List:      xxxxxxxx
Flags:                 RESTARTABLE
Attr:                   checkpoint
StartPriority:       94
rejected for CPU              - (null)
rejected for State              - (null)
NOTE: job req cannot run in partition xxxxx (available procs do not meet requirements : 0 of 32 procs found)
---------------------------------------
Here's the key hint under NOTE. There is no node suitable for this user. But why? Launch another command


# qstat -f 100001
----------------------------------
euser = xxxxx
egroup = xxxxx
queue_rank = xxxxx
queue_type = xxxxx
etime = xxxxxxxx
submit_args = -1 nodes2:ppn=16 ./run.sh
-----------------------------------
Note the key issue is that users submitted a job requesting 2 nodes with 16 cores each. This does not exist in our cluster configuration. Hence it is not able to run....

Using MOAB and Torque Commands together to analyse the problem is useful indeed.

No comments: