I would like to add on to this entry. In the FAQ 17 from OpenMPI, 17. I'm still getting errors about "error registering openib memory"; what do I do?, the FAQ mentioned about the scheduler
Make sure that the resource manager daemons are started with unlimited memlock limits (which may involve editing the resource manager daemon startup script, or some other system-wide location that allows the resource manager daemon to get an unlimited limit of locked memory). Otherwise, jobs that are started under that resource manager will get the default locked memory limits, which are far too small for Open MPI.
The files in limits.d (or the limits.conf file) does not usually apply to resource daemons! The limits.s files usually only applies to rsh or ssh-based logins. Hence, daemons usually inherit the system default of maximum 32k of locked memory (which then gets passed down to the MPI processes that they start). To increase this limit, you typically need to modify daemons' startup scripts to increase the limit before they drop root privliedges.
Some resource managers can limit the amount of locked memory that is made available to jobs. For example, SLURM has some fine-grained controls that allow locked memory for only SLURM jobs (i.e., the system's default is low memory lock limits, but SLURM jobs can get high memory lock limits). See these FAQ items on the SLURM web site for more details: propagating limits and using PAM.
Other related Issues
1. For Torque, you may want to tweak the /etc/init.d/pbs_mom See blog entry Default ulimit setting in torque overide ulimit setting
# service pbs_mom restart
2. See also Encountering Segmentation Fault, Bus Error or No output . In that blog, you have to edit /etc/security/limit.conf
* soft memlock unlimited * hard memlock unlimited
3. If you still have memory issues and using Mellanox IB Cards, do take a look at Registering sufficent memory for OpenIB when using Mellanox HCA
No comments:
Post a Comment