1. Force the Torque Server or MOM to send an obituary of the job ID to the server
# qsig -s 0 job_id
2. Using the momctl command on the compute nodes where the job is listed. You can use a tracejob to check which nodes the job has been send to
# momctl -c job_id -h compute_node_1
3i. Setting the qmgr server setting mom_job_sync to True might help prevent jobs from hanging.
# qmgr -c "set server mom_job_sync = True"
3ii. To verify that the setting in 3i is in, you can use trhe command
# qmgr -c "p s"
4. The final option. If all else fail, do a
qdel -p job_id
For more information, see Adaptive Computing Website Section 11.1.7 Stuck Jobs
No comments:
Post a Comment