pbs_mom.29384;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0This error was quite misleading. I was looking at my protocol which was IB and Ethernet.
When I did a pbsnodes -l, all the compute nodes were down.
# pbsnodes -l node-c00 down node-c01 down ..... .....
After some troubleshooting, I realised that the error is due to use of inconsistent use of short hostname and long hostname. On my /etc/hosts, I used the long hostname for the compute node first (which Torque Server pick up.
192.168.1.2 node-c00.cluster.com node-c00 ...... ......
But on each of the client nodes ie /etc/sysconfig/network, I used the short hostname. This create some confusion for the torque server
HOSTNAME=node-c00
To correct the matter, just rename the HOSTNAME to the long name
HOSTNAME=node-c00.cluster.com
Do a restart of the pbs_mom on the client node and you should get your nodes alive
# service pbs_mom restart
No comments:
Post a Comment