WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: node02
Registerable memory: 32768 MiB
Total memory: 65476 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
The explanation solution can be found at How to increase MTT Size in Mellanox HCA.In summary, the error occurred when applications which consumed a large amount of memory, application might fail when not enough memory can be registered with RDMA. There is a need to increase MTT size. But increasing MTT size hasve the downside of increasing the number of "cache misses" and increases latency.
For a more details writeup. See Registering sufficent memory for OpenIB when using Mellanox HCA (linuxcluster.wordpress.com)

No comments:
Post a Comment