Monday, March 14, 2011

Dealing with stuck jobs and Torque and MAUI

This is a add-on for the blog entry "Manually Deleting Torque amd PBS jobs using MAUI"

1. Force the Torque Server or MOM to send an obituary of the job ID to the server
# qsig -s 0 job_id

2. Using the momctl command on the compute nodes where the job is listed. You can use a tracejob to check which nodes the job has been send to
# momctl -c job_id -h compute_node_1

3i. Setting the qmgr server setting mom_job_sync to True might help prevent jobs from hanging.
# qmgr -c "set server mom_job_sync = True"

3ii. To verify that the setting in 3i is in, you can use trhe command
# qmgr -c "p s"

4. The final option. If all else fail, do a
qdel -p job_id

For more information, see Adaptive Computing Website Section 11.1.7 Stuck Jobs

No comments: