Tuesday, February 26, 2013

Finding the user who used excessive memory and crash the Server

Using linux tools, we can quickly deduce the users who have crashed the system. For me, one of the harder to trace are researchers who run large memory arrays. And if the array is not calculated correctly, it will consume excessive memory than what the server can provide including physical and swap, the system will crash.

One of the fastest is to check who is the users who is last online and having crash and check with /var/log/messages. I just issue the command.

# last

reboot   system boot  2.6.18-238.9.1.e Mon Feb 25 08:42         (1+01:12)
user1    pts/30       :3.0             Sun Feb 24 15:19 - crash  (17:22)
user1    pts/36       :3.0             Sun Feb 24 14:26 - crash  (18:15)

If you can also take a look at the log file, you can see that the crash appeared after the last user log in.

Feb 24 14:25:19 Head-Node gconfd (user1-29338): Resolved address "xml:readwrite:/home/user1/.gconf" to a writable configuration source at position 0
Feb 24 16:15:11 Head-Node mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=2739215:   Reason code 668 Failure Reason Lost membership in cluster nsd1-nas. Unmounting file systems.
Feb 24 16:15:11 Head-Node mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=2739215:
Feb 24 16:18:08 Head-Node kernel: TCP: time wait bucket table overflow
Feb 24 16:18:08 Head-Node kernel: mmfsd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
Feb 24 16:18:08 Head-Node kernel:
Feb 24 16:18:08 Head-Node kernel: Call Trace:
Feb 24 16:18:08 Head-Node kernel:  [] out_of_memory+0x8e/0x2f3
Feb 24 16:18:08 Head-Node kernel:  [] __wake_up+0x38/0x4f
Feb 24 16:18:08 Head-Node kernel:  [] autoremove_wake_function+0x0/0x2e
Feb 24 16:18:08 Head-Node kernel:  [] __alloc_pages+0x27f/0x308
Feb 24 16:18:08 Head-Node kernel:  [] read_swap_cache_async+0x45/0xd8
Feb 24 16:18:08 Head-Node kernel:  [] swapin_readahead+0x60/0xd3
Feb 24 16:18:08 Head-Node kernel:  [] __handle_mm_fault+0xb62/0x1039
Feb 24 16:18:15 Head-Node kernel:  [] thread_return+0x62/0xfe
Feb 24 16:18:28 Head-Node kernel:  [] do_page_fault+0x4cb/0x874
Feb 24 16:18:53 Head-Node kernel:  [] hrtimer_cancel+0xc/0x16
Feb 24 16:19:16 Head-Node kernel:  [] do_nanosleep+0x47/0x70
Feb 24 16:19:29 Head-Node kernel:  [] hrtimer_nanosleep+0x58/0x118
Feb 24 16:19:33 Head-Node kernel:  [] error_exit+0x0/0x84
 

To manage memory more effectively, do look at
  1. Tweaking the Linux Kernel to manage memory and swap usage  

Thursday, February 21, 2013

SingCERT Alerts - SPAM Emails from Yahoo Accounts

The Singapore Computer Emergency Response Team (SingCERT) has received reports of spam emails from Yahoo! accounts containing links to websites selling “work from home” schemes and packages.

The spam emails may have the following characteristics:
  • Empty subject line
  • Email content contains just a website link
For more information, see Yahoo plugs hole that allowed hijacking of email accounts

Tuesday, February 19, 2013

Running OpenMPI in oversubscribe nodes

Taken from OpenMPI FAQ  21. Can I oversubscribe nodes (run more processes than processors)?
Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.
  • Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processors available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress.
  • Aggressive: When Open MPI thinks that it is in an exactly- or under-subscribed mode (i.e., the number of running processes is equal to or less than the number of available processors), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress).
Example of Degraded Modes (Running 4 Slots on 1 Physical  cores). MPI knows that there is only 1 slot and 4 MPI process are running on the single slot.
$ cat my-hostfile
localhost slots=1
$ mpirun -np 4 --hostfile my-hostfile a.out
Example of Aggressive Modes (Running 4 slots on 4 or more Physical Cores). MPI knows that there is at least 4 slots for the 4 MPI process.
$ cat my-hostfile
localhost slots=4
$ mpirun -np 4 --hostfile my-hostfile a.out

Saturday, February 16, 2013

Error when yum install ganglia on CentOS 5.8

 I was doing a yum install of Ganglia on CentOS 5.8,

# yum install php rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd

I got the error as below. 
ganglia-web-3.1.7-3.el5.rf.x86_64 from rpmforge has depsolving problems
  --> Missing Dependency: php-gd is needed by package ganglia-web-3.1.7-3.el5.rf                                                              .x86_64 (rpmforge)
rrdtool-1.4.7-1.el5.rf.x86_64 from rpmforge has depsolving problems
  --> Missing Dependency: libdbi.so.0()(64bit) is needed by package rrdtool-1.4.                                                              7-1.el5.rf.x86_64 (rpmforge)
ganglia-web-3.1.7-3.el5.rf.x86_64 from rpmforge has depsolving problems
  --> Missing Dependency: php is needed by package ganglia-web-3.1.7-3.el5.rf.x8                                                              6_64 (rpmforge)
Error: Missing Dependency: php is needed by package ganglia-web-3.1.7-3.el5.rf.x                                                              86_64 (rpmforge)
Error: Missing Dependency: libdbi.so.0()(64bit) is needed by package rrdtool-1.4                                                              .7-1.el5.rf.x86_64 (rpmforge)
Error: Missing Dependency: php-gd is needed by package ganglia-web-3.1.7-3.el5.r                                                              f.x86_64 (rpmforge)
 You could try using --skip-broken to work around the problem
 You could try running: package-cleanup --problems
                        package-cleanup --dupes
                        rpm -Va --nofiles --nodigest

To avoid the error during the installation, see Installing and Configuring Ganglia on CentOS 5.8
for more information.

Step 1, install the libdbi.so.0(64bits) package
  1. Make sure you have the RPM Repositories installed. For more information, see Useful Repositories for CentOS 5
  2. Make sure installed the libdbi-0.8.1-2.1.x86_64.rpm for CentOS 5.9 be installed on the CentOS 5.8. Apparently, there is no conflict or dependency issues. See RPM Resource libdbi.so.0()(64bit)
    # wget 
    ftp://rpmfind.net/linux/centos/5.9/os/x86_64/CentOS/libdbi-0.8.1-2.1.x86_64.rpm

    # rpm -ivh libdbi-0.8.1-2.1.x86_64.rpm
  3. Install PHP 5.4. See Installing PHP 5.4 on CentOS 5
  4. Finally do the 
  5. # yum install php rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd
  6. For the rest of the steps do look at Installing and Configuring Ganglia on CentOS 5.8

Friday, February 15, 2013

Displaying Repository List with yum

If you wish to display a list of configured repositories which are enabled, you can use the command

# yum repolist

repo id        repo name                                                  status
epel           Extra Packages for Enterprise Linux 5 - x86_64              7,242
ius            IUS Community Packages for Enterprise Linux 5 - x86_64        229
rpmforge       Red Hat Enterprise 5 - RPMforge.net - dag                  11,158
xcat-2-core    xCAT 2 Core packages                                           14
xcat-dep       xCAT 2 depedencies                                             38
repolist: 18,681

# yum -v repolist

Repo-id      : epel
Repo-name    : Extra Packages for Enterprise Linux 5 - x86_64
Repo-revision: 1360778202
Repo-tags    : binary-x86_64
Repo-updated : Thu Feb 14 01:59:23 2013
Repo-pkgs    : 7,242
Repo-size    : 5.4 G
Repo-mirrors : http://mirrors.fedoraproject.org/mirrorlist?repo=epel-5&arch=x86_64
Repo-expire  : 3,600 second(s) (last: Sat Feb 16 01:03:31 2013)

Repo-id      : ius
Repo-name    : IUS Community Packages for Enterprise Linux 5 - x86_64
Repo-revision: 1360910159
Repo-updated : Fri Feb 15 14:36:21 2013
Repo-pkgs    : 229
Repo-size    : 441 M
Repo-mirrors : http://dmirr.iuscommunity.org/mirrorlist/?repo=ius-el5&arch=x86_64
Repo-expire  : 3,600 second(s) (last: Sat Feb 16 01:03:35 2013)

Repo-id      : rpmforge
Repo-name    : Red Hat Enterprise 5 - RPMforge.net - dag
Repo-updated : Fri Dec 21 10:44:36 2012
Repo-pkgs    : 11,158
Repo-size    : 5.8 G
Repo-baseurl : http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/
Repo-mirrors : http://apt.sw.be/redhat/el5/en/mirrors-rpmforge
Repo-expire  : 3,600 second(s) (last: Sat Feb 16 01:03:38 2013)

Repo-id      : xcat-2-core
Repo-name    : xCAT 2 Core packages
Repo-updated : Wed Nov 28 11:01:51 2012
Repo-pkgs    : 14
Repo-size    : 3.2 M
Repo-baseurl : https://sourceforge.net/projects/xcat/files/yum/2.7/xcat-core/
Repo-expire  : 3,600 second(s) (last: Sat Feb 16 01:03:44 2013)

Repo-id      : xcat-dep
Repo-name    : xCAT 2 depedencies
Repo-updated : Wed Feb  6 05:37:39 2013
Repo-pkgs    : 38
Repo-size    : 82 M
Repo-baseurl : https://sourceforge.net/projects/xcat/files/yum/xcat-dep/rh5/x86_64/
Repo-expire  : 3,600 second(s) (last: Sat Feb 16 01:03:51 2013)

repolist: 18,681 

Thursday, February 14, 2013

Installing PHP 5.4 on CentOS 5

First thing first, install the Repository as listed in Useful Repositories for CentOS 5 especially
IUS Community Repository.

Once you have installed repositories, do the following
#  yum install php54 php54-common php54-devel

Wednesday, February 13, 2013

Show Start Estimate for MAUI

The reference is taken from showstart reference from MOAB Workload Manager

This command displays the estimated start time of a job. Since MAUI is a free basic version of MOAB, from what I know, we do not have some of the features present in the MOAB, which is the commercial version of the scheduler.

For MAUI, the  command is

$ showstart JOBID

For example,

$ showstart 22380

job 22380 requires 32 procs for 10:00:00:00
Earliest start in       9:09:38:59 on Fri Feb 22 10:14:05
Earliest completion in 19:09:38:59 on Mon Mar  4 10:14:05
Best Partition: DEFAULT

Tuesday, February 12, 2013

Handling inputs flies on PBS

The writeup on linuxcluster "Handling inputs flies on PBS" shows how you can handle inputs files on PBS both on Single Input File (Serial Run) and Multiple Input Files (Serial Run).

Do read

Monday, February 11, 2013

Display information about the Open MPI installation - ompi_info

ompi_info provides detailed information about the Open MPI installation. It is useful for checking local configuration and seeing how Open MPI was installed and listing of installed Open MPI Plugins and querying what MCA parameters they support. For more information, see man ompi_info

Sample Example of usage from MAN page

1. Show configuration Options
# ompi_info -c

Configured by: root
           Configured on: Sun Mar 27 22:06:50 SGT 2011
          Configure host: HeadNode.mycluster.com
                Built by: root
                Built on: Sun Mar 27 22:17:13 SGT 2011
              Built host: HeadNode.mycluster.com
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: icc
     C compiler absolute: /opt/intel/composerxe-2011.2.137/bin/intel64/icc
             C char size: 1
             C bool size: 1
            C short size: 2
              C int size: 4
             C long size: 8
            C float size: 4
           C double size: 8
          C pointer size: 8
            C char align: 1
            C bool align: 1
             C int align: 4
           C float align: 4
          C double align: 8
.....
.....
.....

2. Output is displayed in a nice-to-read format
$ ompi_info --pretty

.....     

C compiler absolute: /opt/intel/composerxe-2011.2.137/bin/intel64/icc
            C++ compiler: icpc
   C++ compiler absolute: /opt/intel/composerxe-2011.2.137/bin/intel64/icpc
      Fortran77 compiler: ifort
  Fortran77 compiler abs: /opt/intel/composerxe-2011.2.137/bin/intel64/ifort
      Fortran90 compiler: ifort
  Fortran90 compiler abs: /opt/intel/composerxe-2011.2.137/bin/intel64/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
.....
.....
.....

3. Display All MCA Parameters For OpenIB

# ompi_info --param btl openib

.....
.....
.....  MCA btl: parameter "btl_openib_max_btls" (current value: "-1", data source: default value)
                          Maximum number of device ports to use (-1 = use all available, otherwise must be >= 1)
                 MCA btl: parameter "btl_openib_free_list_num" (current value: "8", data source: default value)
                          Intial size of free lists (must be >= 1)
                 MCA btl: parameter "btl_openib_free_list_max" (current value: "-1", data source: default value)
                          Maximum size of free lists (-1 = infinite, otherwise must be >= 0)
                 MCA btl: parameter "btl_openib_free_list_inc" (current value: "32", data source: default value)
                          Increment size of free lists (must be >= 1)
.....
.....
.....

If you are using TCP, you can use similar command

# ompi_info --param btl tcp


For more information, see Using MCA Parameters With mpirun

Sunday, February 10, 2013

Useful Repositories for CentOS 5


RPMForge - This repository is a collaboration of Dag
i386
# wget
http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm
x86_64
# wget
http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.x86_64.rpm



Extra Packages for Enterprise Linux Repository (EPEL)
Architecture-independent
# wget 
http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm



IUS Community Repository -
IUS is a new third party repo for RHEL that provides the "latest upstream versions of PHP, Python, MySQL.

# wget
http://dl.iuscommunity.org/pub/ius/stable/Redhat/5/x86_64/ius-release-1.0-10.ius.el5.noarch.rpm

For a full-listing, see

Wednesday, February 6, 2013

General run-time tuning for Open MPI 1.4 and later (Part1)

Taken from 17. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the –by* and –bind-to-* options?)
During the mpirun, you can put in the parameter of the Open MPI 1.4 and above to improve performance
  1. –bind-to-none: Do not bind processes (Default)
  2. –bind-to-core: Bind each MPI process to a core
  3. –bind-to-socket: Bind each MPI process to a processor socket
  4. –report bindings: Report how the launches processes are bound by Open MPI
If the hardware has multiple hardware threads like those belonging to Hyperthreading, only the first thread of each core is used with the -bind-to-*. According to the article, it is supposed to be fixed in v1.5
The following options below is to be used with –bind-to-*
  1. –byslot: Alias for –bycore
  2. –bycore: When laying out processes, put sequential MPI processes on adjacent processor cores. (Default)
  3. –bysocket: When laying out processes, put sequential MPI processes on adjacent processor sockets.
  4. –bynode: When laying out processes, put sequential MPI processes on adjacent nodes.
Finally you can use the –cpus-per-procs which binds ncpus OS processor IDS to each MPI process. If there is a machine with 4 cores and 4 cores, hence 16 cores in total.
$ mpirun -np 8 --cupus-per-proc 2 my_mpi_process
The command will bind each MPI process to ncpus=2 cores. All cores on the machine will be used.

Sunday, February 3, 2013

Unable to restart pbs_mom on nodes

I was unable to restart the pbs_mom from one of the compute node. A look at the log file at /var/spool/torque/mom_logs shows

 # less /var/spool/torque/mom_logs

pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, 
Unable to get my full hostname for grapefruit.local.spms.ntu.edu.sg error -1

 Once you have this type of error, torque server will not be able to manage pbs_mom on the node which have this error

To solve the issues, it is very simple, you have to resolve the hostname discrepancy client hostname information between what the Torque Server has and what Torque Client. Check that the /etc/hostname or /etc/resolv.conf have the necessary information. Look at the Changing the hostname on CentOS on how to change hostname.

Saturday, February 2, 2013

Friday, February 1, 2013

Unable to SSH due to failure of authentication with GSSAPI

I have a very slow or failed ssh connection due to failure with authentication with GSSAPI. See below my error

.....
debug1: Authentications that can continue: publickey,gssapi-with-mic,password
debug1: Next authentication method: gssapi-with-mic
debug1: Unspecified GSS failure.  Minor code may provide more information Unknown code krb5 195
debug1: Unspecified GSS failure.  Minor code may provide more information Unknown code krb5 195
debug1: Unspecified GSS failure.  Minor code may provide more information Unknown code krb5 195
debug1: Next authentication method: publickey
debug1: Trying private key: /home/user1/.ssh/identity
.....

It seems that when the ssh connection first try to authenticate with GSSAPI and when it failed, it switched to the publickey.

To change the ssh at user level
$ vim ~/.ssh/config

GSSAPIAuthentication no 

To change at global level
# /etc/ssh/sshd_config

GSSAPIAuthentication no 

More Information
  1. Slow SSH connections – hanging at GSSAPI auth