- Disk Lease Expiration - GPFS uses a mechanism referred to as a disk lease to prevent file system data corruption by a failing node. A disk lease grants a node the right to submit IO to a file system. File system disk leases are managed by the Cluster Manager of the file system's home cluster. A node must periodically renew it's disk lease with the Cluster Manager to maintain it's right to submit IO to the file system. When a node fails to renew a disk lease with the Cluster Manager, the Cluster Manager marks the node as failed, revokes the node's right to submit IO to the file system, expels the node from the cluster, and initiates recovery processing for the failed node.
- Node Expel Request - GPFS uses a mechanism referred to as a node expel request to prevent file system resource deadlocks. Nodes in the cluster require reliable communication amongst themselves to coordinate sharing of file system resources. If a node fails while owning a file system resource, a deadlock may ensue. If a node in the cluster detects that another node owing a shared file system resource may have failed, the node will send a message to the file system Cluster Manger requesting the failed node to be expelled from the cluster to prevent a shared file system resource deadlock. When the Cluster Manager receives a node expel request, it determines which of the two nodes should be expelled from the cluster and takes similar action as described for the Disk Lease expiration.
Fri May 27 16:34:53.249 2016: Expel 172.16.20.5 (goldsvr1) request from 192.168.104.34 (compute186). Expelling: 192.168.104.34 (compute186) Fri May 27 16:34:53.259 2016: Recovering nodes: 192.168.104.34 Fri May 27 16:34:53.311 2016: Recovered 1 nodes for file system gpfs3. Fri May 27 16:34:55.636 2016: Accepted and connected to 10.0.104.34 compute186 <c0n135> Fri May 27 16:39:13.333 2016: Expel 172.16.20.5 (goldsvr1) request from 192.168.104.45 (compute197). Expelling: 192.168.104.45 (compute197) Fri May 27 16:39:13.334 2016: VERBS RDMA closed connection to 192.168.104.45 compute197 on mlx4_0 port 1 Fri May 27 16:39:13.344 2016: Recovering nodes: 192.168.104.45 Fri May 27 16:39:13.393 2016: Recovered 1 nodes for file system gpfs3. Fri May 27 16:39:15.725 2016: Accepted and connected to 10.0.104.45 compute197 <c0n141> Fri May 27 16:40:18.570 2016: VERBS RDMA accepted and connected to 192.168.104.45 on mlx4_0 port 1