Thursday, March 6, 2014

PBS_MOM Error Mismatching protocols. Expected protocol 4 but read reply for 0

I encountered this error on my compute nodes using Torque 4.2.5.
pbs_mom.29384;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
 This error was quite misleading. I was looking at my protocol which was IB and Ethernet.

 When I did a pbsnodes -l, all the compute nodes were down.
# pbsnodes -l
node-c00 down
node-c01 down
.....
..... 

After some troubleshooting, I realised that the error is due to use of inconsistent use of short hostname and long hostname. On my /etc/hosts, I used the long hostname for the compute node first (which Torque Server pick up.

192.168.1.2     node-c00.cluster.com    node-c00
......
...... 

But on each of the client nodes ie /etc/sysconfig/network, I used the short hostname. This create some confusion for the torque server

HOSTNAME=node-c00

To correct the matter, just rename the HOSTNAME to the long name
HOSTNAME=node-c00.cluster.com

Do a restart of the pbs_mom on the client node and you should get your nodes alive
# service pbs_mom restart




No comments: