Friday, August 30, 2013

Diagnostic Tools to diagnose Infiniband Fabric Information

There are a few diagnostic tools to diagnose Infiniband Fabric Information. Use man for the parameters for the
  1. ibnodes - (Show Infiniband nodes in topology)
  2. ibhosts - (Show InfiniBand host nodes in topology)
  3. ibswitches- (Show InfiniBand switch nodes in topology)
  4. ibnetdiscover - (Discover InfiniBand topology)
  5. ibchecknet - (Validate IB subnet and report errors)
  6. ibdiag (Scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices)
  7. perfquery (find errors on a particular or number of HCA's and switch ports)
For more information, do look at Diagnostic Tools to diagnose Infiniband Fabric Information

Thursday, August 29, 2013

New Intel® Enterprise Edition for Lustre* Software Designed to Simplify Big Data Management, Storage

  1. Intel® Enterprise Edition for Lustre* software helps simplify configuration, monitoring, management and storage of high volumes of data.
  2. With Intel® Manager for Lustre* software, Intel is able to extend the reach of Lustre into new markets such as financial services, data analytics, pharmaceuticals, and oil and gas.
  3. When combined with the Intel® Distribution for Apache Hadoop* software, Hadoop users can access Lustre data files directly, saving time and resources.
  4. New software offering furthers Intel's commitment to drive new levels of performance and features through continuing contributions the open source community.
For more information, see Intel Expands Software Portfolio for Big Data Solutions

Wednesday, August 28, 2013

Issues when compiling OpenMPI 1.6 and Intel 2013 XE on CentOS 5

I encountered this same error when compiling on CentOS 5.3 and using OpenMPI 1.6 and Intel 2013 XE

opal_wrapper.c:(............): undefined reference to `__intel_sse2_strdup'
opal_wrapper.c:(............): undefined reference to `__intel_sse2_strncmp'
opal_wrapper.o:opal_wrapper.c:(.......): more undefined references to 
`__intel_sse2_strncmp' follow

There is one interesting forum discussion from Intel (Problem with Openmpi+c++ compiler)
that helps to explain how someone solve it. Basically, it could be due to mixed mis-match libraries and binaries especially if you have multiple version of intel compilers. Do look at your $PATH and $LD_LIBRARY_PATH

Even with correct editing of the $PATH and $LD_LIBRARY_PATH, I was not able to get aaway from the error. But when I use CentOS 6 with OpenMPI 1.6 and Intel XE 2013, was I able spare from this errror.

Monday, August 26, 2013

Announcing the Release of MVAPICH2 2.0a, MVAPICH2-X 2.0a and OSU Micro-Benchmarks (OMB) 4.1

The MVAPICH team is pleased to announce the release of MVAPICH2 2.0a, MVAPICH2-X 2.0a (Hybrid MPI+PGAS (OpenSHMEM) with Unified Communication
Runtime) and OSU Micro-Benchmarks (OMB) 4.1.

Features, Enhancements, and Bug Fixes for MVAPICH2 2.0a (since MVAPICH2 1.9GA release) are listed here.

* Features and Enhancements (since 1.9GA):
     - Based on MPICH-3.0.4
     - Dynamic CUDA initialization. Support GPU device selection after MPI_Init
     - Support for running on heterogeneous clusters with GPU and non-GPU nodes
     - Supporting MPI-3 RMA atomic operations and flush operations
       with CH3-Gen2 interface
     - Exposing internal performance variables to MPI-3 Tools information
       interface (MPIT)
     - Enhanced MPI_Bcast performance
     - Enhanced performance for large message MPI_Scatter and MPI_Gather
     - Enhanced intra-node SMP performance
     - Tuned SMP eager threshold parameters
     - Reduced memory footprint
     - Improved job-startup performance
     - Warn and continue when ptmalloc fails to initialize
     - Enable hierarchical SSH-based startup with Checkpoint-Restart
     - Enable the use of Hydra launcher with Checkpoint-Restart

* Bug-Fixes (since 1.9GA):
     - Fix data validation issue with MPI_Bcast
         - Thanks to Claudio J. Margulis from University of Iowa for the report
     - Fix buffer alignment for large message shared memory transfers
     - Fix a bug in One-Sided shared memory backed windows
     - Fix a flow-control bug in UD transport
         - Thanks to Benjamin M. Auer from NASA for the report
     - Fix bugs with MPI-3 RMA in Nemesis IB interface
     - Fix issue with very large message (>2GB bytes) MPI_Bcast
         - Thanks to Lu Qiyue for the report
     - Handle case where $HOME is not set during search for MV2 user config file
         - Thanks to Adam Moody from LLNL for the patch
     - Fix a hang in connection setup with RDMA-CM

MVAPICH2-X 2.0a software package provides support for hybrid MPI+PGAS (UPC and OpenSHMEM) programming models with unified communication runtime for emerging exascale systems. This software package provides flexibility for users to write applications using the following programming models with a unified communication runtime: MPI, MPI+OpenMP, pure UPC, and pure OpenSHMEM programs as well as hybrid MPI(+OpenMP) + PGAS (UPC and
OpenSHMEM) programs.

Features and enhancements for MVAPICH2-X 2.0a (since MVAPICH2-X 1.9GA) are as follows:

* Features and Enhancements (since 1.9GA):
     - OpenSHMEM Features
         - Optimized OpenSHMEM Collectives (Improved performance for
           shmem_collect, shmem_barrier, shmem_reduce and shmem_broadcast)

     - MPI Features
         - Based on MVAPICH2 2.0a (OFA-IB-CH3 interface)

     - Unified Runtime Features
         - Based on MVAPICH2 2.0a (OFA-IB-CH3 interface). All the runtime
           features enabled by default in OFA-IB-CH3 interface of
           MVAPICH2 2.0a are available in MVAPICH2-X 2.0a

New features and Enhancements of OSU Micro-Benchmarks (OMB) 4.1 (since OMB
4.0.1 release) are listed here.

* New Features & Enhancements
     - New OpenSHMEM benchmarks
         * osu_oshm_barrier
         * osu_oshm_broadcast
         * osu_oshm_collect
         * osu_oshm_reduce
     - New MPI-3 RMA Atomics benchmarks
         * osu_cas_flush
         * osu_fop_flush

For downloading MVAPICH2 2.0a, MVAPICH2-X 2.0a, OMB 4.1, associated user guides, quick start guide, and accessing the SVN, please visit the following URL:

Friday, August 23, 2013

Open Compute Project

Open Compute Project is the results of a group of facebook engineers contribution back to the community what they have learned to build a highly energy efficient Data Centre.

There are a few design specifications:
  1. Server Design Specification
  2. Storage Design Specification 
  3. Data Center Design Specification
  4. Virtual IO Design Specification
  5. Hardware Management Specification
  6. Certification Standards

Tuesday, August 20, 2013

Building OpenMPI Libraries for 64-bit integers

There is an excellent article on how to build OpenMPI Libraries for 64-bit integers. For more detailed information, do look at How to build MPI libraries for 64-bit integers

The information on this website is taken from the above site.

Step 1: Check the integer size. Do the following:
# ompi_info -a | grep 'Fort integer size' 
If the output is as below, you have to compile OpenMPI with 64 bits.
Fort integer size: 4

Intel Compilers 
Step 2a: To compile OpenMPI with Intel Compilers and with 64-bits integers, do the following:
# ./configure --prefix=/usr/local/openmpi CXX=icpc CC=icc 
F77=ifort FC=ifort FFLAGS=-i8 FCFLAGS=-i8
# make -j 8
# make install

* GNU Compilers
Step 2b: To compile OpenMPI with GNU Compilers and with 64-bits, do the followings:
# ./configure --prefix=/usr/local/openmpi CXX=g++ CC=gcc F77=gfortran FC=gfortran \
FFLAGS="-m64 -fdefault-integer-8"       \
CFLAGS="-m64 -fdefault-integer-8"      \
CFLAGS=-m64                             \
# make -j 8
# make install

Step 3: Update your PATH and LD_LIBRARY_PATH in your .bashrc
export $PATH=/usr/local/openmpi/bin:$PATH 
export $LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

Verify that the installation is correct
# ompi_info -a | grep 'Fort integer size' 
Fort integer size: 8

Monday, August 12, 2013

Registering sufficent memory for OpenIB when using Mellanox HCA

If you encountered errors like "error registering openib memory" similar to what is written below. You may want to take a look at the OpenMPI FAQ - I'm getting errors about "error registering openib memory"; what do I do? .

WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module

  Local host:              node02
  Registerable memory:     32768 MiB
  Total memory:            65476 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
The explanation solution can be found at How to increase MTT Size in Mellanox HCA.

In summary, the error occurred when applications which consumed a large amount of memory, application might fail when not enough memory can be registered with RDMA. There is a need to increase MTT size. But increasing MTT size hasve the downside of increasing the number of "cache misses" and increases latency.

For a more details writeup. See  Registering sufficent memory for OpenIB when using Mellanox HCA (

Thursday, August 8, 2013

Tracking Batch Jobs at Platform LSF

The content article is taken from

1. Displaying All Job Status
# bjobs -u all

2. Report Reasons why a job is pending
# bjobs -p

3. Report Pending Reasons with host names for each conditions
# bjobs -lp

4. Detailed Report on a specific jobs
# bjobs -l 6653

5. Reasons why the job is suspended
# bjobs -s

6. Displaying Job History
# bpeek 12345

7. Killing Jobs
# bkill 12345

8. Stop the Job
# bstop 12345

9 Resume the job
# bresume 12345

Other References:

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF - View information about cluster
  4. Platform LSF - Submitting and Controlling jobs

Tuesday, August 6, 2013

Virtual Memory PAGESIZE on CentOS

There is a very good write-up on Linux Virtual Memory PAGESIZE from Nixcraft
(Linux Find Out Virtual Memory PAGESIZE).

To get the Linux Virutal Memory PAGESIZE, do use the following command
#  getconf PAGESIZE

Sunday, August 4, 2013

Installing pdsh to issue commands to a group of nodes in parallel in CentOS

1. What is pdsh? Pdsh is a high-performance, parallel remote shell utility. It uses a sliding window of threads to execute remote commands, conserving socket resources while allowing some connections to timeout if needed. It was originally written as a replacement for IBM's DSH on clusters at LLNL. More information can be found at PDSH Web site

2. Setup EPEL yum repository on CentOS 6. For more information, see Repository of CentOS 6 and Scientific Linux 6  

3. Do a yum install
# yum install pdsh
To confirm installation
# which pdsh

4. Configure user environment for PDSH
# vim /etc/profile.d
Edit the following:
# setup pdsh for cluster users
export PDSH_RCMD_TYPE='ssh'
export WCOLL='/etc/pdsh/machines'

5. Put the host name of the Compute Nodes
# vim /etc/pdsh/machines/


6. Make sure the nodes have their SSH-Key Exchange. For more information, see Auto SSH Login without Password 7. Do Install Step 1 to Step 3 on ALL the client nodes.

B. USING PDSH Run the command ( pdsh [options]... command )

1. To target all the nodes found at /etc/pdsh/machinefile. Assuming the files are transferred already. Do note that the parallel copy comes with the pdsh utilities
# pdsh -a "rpm -Uvh /root/htop-1.0.2-1.el6.rf.x86_64.rpm"

2. To target specific nodes, you may want to consider using the -x command
# pdsh -x host1,host2 "rpm -Uvh /root/htop-1.0.2-1.el6.rf.x86_64.rpm"
  1. Install and setup pdsh on IBM Platform Cluster Manager
  2. PDSH Project Site
  3. PDSH Download Site (Sourceforge)

Saturday, August 3, 2013

Using nvidia-smi to get information on GPU Cards

NVIDIA’s System Management Interface (nvidia-smi) is a useful tool to manipulate and control the GPU Cards. There are a few use case listed here

1. Listing of NVIDIA GPU Cards
# nvidia-smi -L

GPU 0: Tesla M2070 (S/N: 03212xxxxxxxx)
GPU 1: Tesla M2070 (S/N: 03212yyyyyyyy)

2. Display GPU information
# nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp : Sun Jul 28 23:49:20 2013

Driver Version : 295.41

Attached GPUs : 2

GPU 0000:19:00.0
Product Name : Tesla M2070
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 03212xxxxxxxx
GPU UUID : GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx
VBIOS Version : 70.00.3E.00.03
Inforom Version
OEM Object : 1.0
ECC Object : 1.0
Power Management Object : 1.0
Bus : 0x19
Device : 0x00
Domain : 0x0000
Device Id : 0xxxxxxxxx
Bus Id : 0000:19:00.0
Sub System Id : 0x083010DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P0
Memory Usage
Total : 6143 MB
Used : 10 MB
Free : 6132 MB
Compute Mode : Exclusive_Thread
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Gpu : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1566 MHz
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1566 MHz
Compute Processes : None

# nvidia-smi -i 0 -q -d MEMORY,ECC

==============NVSMI LOG==============

Timestamp                       : Mon Jul 29 00:04:36 2013

Driver Version                  : 295.41

Attached GPUs                   : 2

GPU 0000:19:00.0
Memory Usage
Total                   : 6143 MB
Used                    : 10 MB
Free                    : 6132 MB
Ecc Mode
Current                 : Disabled
Pending                 : Disabled
ECC Errors
Single Bit
Device Memory   : N/A
Register File   : N/A
L1 Cache        : N/A
L2 Cache        : N/A
Total           : N/A
Double Bit
Device Memory   : N/A
Register File   : N/A
L1 Cache        : N/A
L2 Cache        : N/A
Total           : N/A
Single Bit
Device Memory   : N/A
Register File   : N/A
L1 Cache        : N/A
L2 Cache        : N/A
Total           : N/A
Double Bit
Device Memory   : N/A
Register File   : N/A
L1 Cache        : N/A
L2 Cache        : N/A
Total           : N/A

Friday, August 2, 2013

Compiling MVAPICH2-1.9 with Intel and CUDA

Step 1: Download the MVAPICH1.19 from the . The current version at point of writing is MVAPICH2

 Step 2: Compile the MPAPICH2 with intel and cuda.

# tar -zxvf mvapich2-1.9.gz
# cd mvapich2-1.9
# mkdir buildmpi
#  ../configure --prefix=/usr/local/mvapich2-1.9-intel-cuda CC=icc CXX=icpc F77=ifort FC=ifort
--with-cuda=/opt/cuda/ --with-cuda-include=/opt/cuda/include --with-cuda-libpath=/opt/cuda/lib64
# make -j8
# make install

Thursday, August 1, 2013

Turning off and on ECC RAM for NVIDIA GP-GPU Cards

From NVIDIA Developer site.

Turn off ECC (C2050 and later). ECC can cost you up to 10% in performance and hurts parallel scaling. You should verify that your GPUs are working correctly, and not giving ECC errors for example before attempting this. You can turn this off on Fermi based cards and later by running the following command for each GPU ID as root, followed by a reboot:
Extensive testing of AMBER on a wide range of hardware has established that ECC has little to no benefit on the reliability of AMBER simulations. This is part of the reason it is acceptable (see recommended hardware) to use the GeForce gaming cards for AMBER simulations.

1. To Turn off the ECC RAM, just do a
# nvidia-smi -g 0 --ecc-config=0
(repeat with -g x for each GPU ID)

2. To Turn back on ECC RAM, just do
# nvidia-smi -g 0 --ecc-config=1
(repeat with -g x for each GPU ID)