Saturday, December 31, 2011

How to disable SSLv2 and Weak Cipers and enable SSLv3 on Linux

In order to be Payment Card Industry Data Security Standard PCI-DSS) Compliance v1.2, we are required to use “use strong cryptography and security protocols such as SSL/TLS or IPSEC to safeguard sensitive cardholder data during transmission over open, public networks.”

Secure Socket Layer (SSL) version 2 is considered weak cryptography in this aspect. To disabled SSLv2 and enable SSLv3. Assuming you already have OpenSSL installed, you can use another remote server to test the https connections

# openssl s_client -ssl2 -connect remote_server:443

If your server does not support SSLv2, you should receive the following error
CONNECTED(00000003)
22255:error:1407F0E5:SSL routines:SSL2_WRITE:ssl handshake failure:s2_pkt.c:428:

If your server is enabled to supports SSLv2 connections, the connection will be accepting input
CONNECTED(00000003)

 To use SSLv3 and TLSv1, you have to modify the following at SSLCipherSuite directive in the httpd.conf or /etc/httpd/conf.d/ssl.conf file. In the example, you can do the following
#SSLProtocol all -SSLv2
SSLProtocol -all +SSLv3 +TLSv1
On my /etc/httpd/conf.d/ssl.conf
SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:!LOW:!EXP:RC4+RSA:+HIGH:+MEDIUM

For more information,see
  1. How to Disable SSLv2 and Weak Ciphers(PCI Compliance (http://almamunbd.blogspot.com)
  2. How to Disable SSLv2 and Weak Ciphers(PCI Compliance (http://www.srcnix.com)


Friday, December 30, 2011

Important Apache (httpd) security Update

An important security update for httpd and solution for
  1. 'Devastating' Apache bug leaves servers exposed
  2. Apache released 2nd workaround for Devastating' Apache bug


Description of the bugs can be found at CVE-2011-3192

The byterange filter in the Apache HTTP Server 1.3.x, 2.0.x through 2.0.64, and 2.2.x through 2.2.19 allows remote attackers to cause a denial of service (memory and CPU consumption) via a Range header that expresses multiple overlapping ranges, as exploited in the wild in August 2011, a different vulnerability than CVE-2007-0086.

Solution:

# yum update httpd

Monday, December 19, 2011

Upgrading of Broadcom Drivers to resolve eth0 NIC SerDES Link is Down

If the post Encountering eth0 NIC SerDES Link is Down did not resolve your issue and you are still encountering "eth0 NIC SerDES Link" issues, do upgrade the Broadcom Drivers from your vendor site and it will eliminate your issue immediately. Since my vendor is IBM, so I downloaded the Broadcom BNX2 Drivers
Broadcom BNX2 driver version bnx2-2.0.23b for RHEL 5 - IBM System x and BladeCenter

If you are not sure what is your version of drivers, you can do a
# ethtool -i eth0

The version 2.0.8 and above should resolve the above issue

Oh yes, if you are using IBM Products and the above drivers from IBM, after unpacking the drivers from IBM and ensuring you have the necessary prerequistics, just do a

If you are using the Free Clone of Redhat which includes CentOS or Scientific Linux, you may want to temporarily modify the /etc/redhat-release information to simulate a real RHEL Distribution. Vendor patches often requires RHEL distribution

#CentOS release 5.4 (Final)
Red Hat Enterprise Linux AS release 5


# mkdir brcm
# cd brcm
# tar -zxvf brcm_dd_nic_netxtreme2-2.0.23b_1.62.15_rhel5_32-64.tgz
# ./install.pl --update 

 INSTALL_OPTIONS --yes --update


        Drivers will be installed/migrated to 2.6.18-164 version

----------------------------------------------------------------------
Checking kmod-brcm-netxtreme2-6.2.23-1.x86_64.rpm
WARNING: Non Whitelist symbol detected
----------------------------------------------------------------------
kmod-brcm-netxtreme2-6.2.23-1.x86_64.rpm installed successfully
SUCCESS

Saturday, December 17, 2011

sys_copy and scp -rpb error captured on pbs_mom logs

I was encountering an interesting scp error on my log file regarding pbs_mom

.......pbs_mom: LOG_ERROR::sys_copy, 
command '/usr/bin/scp -rpB  2014.Head-Node.OU 
userid@headnode:/home/xxx' failed with status=1, 
giving up after 4 attempts

It seems that the error may be due to default MaxStartups 10 setting in the /etc/ssh/sshd_config which is too low a value and scp may be overwhelm

According to manual page
MaxStartups - Specifies the maximum number of concurrent unauthenticated connections to the sshd daemon.  Additional connections will be dropped until authentication succeeds or the LoginGraceTime
expires for a connection. The default is 10.


Try increasing the MaxStartups to 100 at /etc/ssh/sshd_config
MaxStartups 100 

Friday, December 16, 2011

Blade hangs on boot and "FW/BIOS, firmware progress (ABR Status) FW/BIOS ROM corruption

2 of my Blade got hang on boot and suffered this "FW/BIOS, firmware progress (ABR Status) FW/BIOS ROM corruption". For more information on the resolution, do look at Blade hangs on boot and "FW/BIOS, firmware progress (ABR Status) FW/BIOS ROM corruption" message in AMM - IBM BladeCenter HS22, HS22V

From the site,


Symptom
When booting BladeCenter HS22 or HS22V with Integrated Management Module (IMM) build yuoo84c installed, the blade may hang at the "UEFI Platform Initializing" screen. The hang will be accompanied by the following event in the chassis Advanced Management Module (AMM) log:
   

FW/BIOS, firmware progress (ABR Status) FW/BIOS ROM corruption



Solution
This behavior is corrected in IMM firmware release yuoo91e and newer.
The file is or will be available by selecting the appropriate machine type on the 'Product View' of IBM Support's Fix Central web page, at the following URL:
http://www.ibm.com/support/fixcentral/systemx/groupView?query.productGroup=ibm%2FBladeCenter




Workaround
This failure may be reduced by disabling Internet Protocol Version 6 (IPv6) support for the IMM. This can be done via the following steps:
1. Boot the blade to the F1 Unified Extensible Firmware Interface (UEFI) setup screen.
2. Select "System Settings" and press Enter
3. Select "Integrated Management Module" and press Enter
4. Select "Network Configuration" and press Enter
5. Change "IP6" setting to "Disable"

Occasionally, the failure can be recovered by restarting the IMM. If this is not successful, then it is necessary to reseat the blade in the chassis to recover. After a reseat, the blade will boot normally.

Thursday, December 15, 2011

Unable to edit fstab as it is a read only file during repair

I unwittingly changed the label for a partition for the /etc/fstab and was was presented with bootup to bash.
When I tried to revert back to the correct label for the partition, the vi just could not save the newly edited settings, instead it will have the error message "Error writing fstab: Read-only file system"

To solve the issue, you have to remount

mount -n -o remount / 
which work fine for me.

Or
mount -n -o remount -t ext2 /dev/hda2 / 

Wednesday, December 14, 2011

Checking Torque Queue Attributes

If you wish to check Queue Attributes fully, you can use the command
qstat -f -Q  queuename

The output will be
Queue: dqueue
    queue_type = Execution
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
    resources_default.neednodes = starfruit
    mtime = 1323678795
    resources_assigned.nodect = 0
    enabled = True
    started = True

Tuesday, December 13, 2011

Using vim editor to find and replace effectively

Taken from this excellent article Vi and Vim Editor: 12 Powerful Find and Replace Examples

Here is a few examples that I love to use. You can see that this entry is a notepad for me.

Scenario 1: Replace all occurrences of a text with another text in the whole file
:%s/old-text/new-text/g
%s - specifies all lines. Specifying the range as ‘%’ means do substitution in the entire file.
g flag– specifies all occurrences in the line. With the ‘g’ flag , you can make the whole line to be substituted. If this ‘g’ flag is not used then only first occurrence in the line only will be substituted.

 Scenario 2: Replace of a text with another text within a range of lines
:1,10s/old-text/new-text/gi
1-10 - Do substitution from line 1 to 10
i flag - Make the substitute search text to be case insensitive.


Scenario 3:  Replacing of a text with another text for a the 1st X number of lines
From the current position of the cursor, the command will replace according to the number of count. For example, do substitution in 10 lines from the current line.
:s/old-text/new-text/g 10

Scenario 4: Substitute only the whole word and not partial match
If you wish to change the whole word "text" to "new-text"
Original Text: old to text
:s/\<text\>/new-text/
Translated Text: old to new-text


Sunday, December 11, 2011

How to associate compute nodes with a queue name with Torque

If you wish to use a queue that is locked to a selected group of nodes and wish to allow certain users to run, you may want to take a look at one of the contributor to a Rock-Discussion
[Rocks-Discuss] [Torque roll] How to associate 10 compute nodeswith a queue name ?

In his write-up

========
qmgr -c "create queue vision queue_type=execution"
qmgr -c "set queue vision resources_default.neednodes = vision"
qmgr -c "set queue vision acl_hosts=c2-0-20+c2-0-21+c2-0-22+c2-0-27+c2-0-28+c2-0-29"
qmgr -c "set queue vision acl_host_enable = false"
qmgr -c "set queue vision acl_users=user1"
qmgr -c "set queue vision acl_users+=user2"
qmgr -c "set queue vision acl_users+=user3"
qmgr -c "set queue vision acl_user_enable=true"
qmgr -c "set queue vision enabled = True"
qmgr -c "set queue vision started = True"

qmgr -c "set queue default resources_default.neednodes = general"

for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 23 24 25 26 30 31 32 33; 
do 
        /opt/torque/bin/qmgr -c "set node c2-0-${i} properties = general"; 
done

for i in 20 21 22 27 28 29 ; 
do 
        /opt/torque/bin/qmgr -c "set node c2-0-${i} properties = vision"; 
done
========
 
For more information, do also read up on
  1. 4.1 Queue Configuration (From Cluster Resources)
  2. Cluster Node-Locking with Torque and Maui (Wednesday, October 22, 2008)

Thursday, December 8, 2011

Configuration error when compiling octave with BLAS and LAPACK libraries

Do take a look at the Compiling Octave from Source on CentOS 5. However you make face an error such as
" configure: error: You are required to have BLAS and LAPACK libraries ".

This is due to the missing link. For more information on this error, you may want to take some hints from 
Cannot find -llapack when doing /usr/bin/ld on CentOS 5

In other words, just go to /usr/lib64 and do a softlink for the lapack library
ln -s /usr/lib64/liblapack.so.3 /usr/lib64/liblapack.so
 

Friday, December 2, 2011

Encountering eth0 NIC SerDES Link is Down

I was noticing this error on my HS22 Blade log files occasionally and on one occasion the NFS which was relying on the ethernet connection got disconnected and hang when the load is exceedingly high. The problem is that it is very hard to reproduce the problem as it is quite random

My Server is using the Broadcom chipset bnx2 and my version of my CentOS is 5.4 or kernel  version is 2.6.18-164.el5

After a bit of searching, this particular Red Hat Bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=520888) reflects the problem and workaround very well. I encourage you to take a closer look. If you are not planning to upgrade your RHEL or CentOS to 5.6 ( http://rhn.redhat.com/errata/RHSA-2011-0017.html ) and above yet, you may want to consider the workaround as mentioned in the bugzilla



 From Comments 14

"Configuring IRQ SMP affinity has no effect on some devices that use message signalled interrupts (MSI) with no MSI per-vector masking capability. Examples of such devices include Broadcom NetXtreme Ethernet devices that use the bnx2 driver. 

If you need to configure IRQ affinity for such a device, disable MSI by creating a file in /etc/modprobe.d/ containing the following line: 

options bnx2 disable_msi=1 

Alternatively, you can disable MSI completely using the kernel boot parameter pci=nomsi. (BZ#432451)

" http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Technical_Notes/Known_Issues-kernel.html

To Check whether you are still having issues, you can use the command
# dmesg |grep bnx2
I guess the best way is to update your broadcom drivers. For latest update on this "NIC SerDES Link is Down", see my writeup on Upgrading of Broadcom Drivers to resolve eth0 NIC SerDES Link is Down

Thursday, December 1, 2011

Using stunnel to generate to create a self-signed certificate for SL 6 and CentOS 6

 The stunnel Program allows administrator to create self-signed certification using external OpenSSL Libraries included with RHEL and its clone to provide strong cryptography and protect connection. For more information on the installation and setup, see Using stunnel to generate to create a self-signed certificate for SL 6 and CentOS 6

Wednesday, November 30, 2011

pbs_mom LOG_ERROR sys_copy, command /usr/bin/scp -rpB

I encountered 1 of my parallel job failed and this error appeared on the log file for my compute nodes. 

pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB...............failed with status=1, 
giving up after 4 attempts

My SSH public/private key authentication is working without a hitch. Similarly, my /etc/hosts and firewall is as what I expected. But I realise my /etc/resolv.conf and /etc/sysconfig/network are incorrect. I got a hint of this possibility when I was reading this forum http://www.mail-archive.com/mauiusers@supercluster.org/msg00998.html . A quick amendment and everything seems ok at least for a while. Will write if this solution is incorrect. :)

Tuesday, November 29, 2011

Companies Comparison for Storage in the Gartner Ranking

For information on companies selling storage boxes, do look at the article on
Ball-gazer casts magic runes to heal HP's credibility. Swallowing startups pushes up Gartner ranking.
NetApp and EMC are the leaders with NetApp ahead of EMC on the vision scale.


Installing Pylith using Pylith Installer


PyLith is a finite element code for the solution of dynamic and quasi-static tectonic deformation problems.

This entry will only focus on the compilation of Pylith from the installer. Most if not all of the information comes from INSTALLER files after you untar the software.

For more information, see Installing Pylith using Pylith Installer



Thursday, November 24, 2011

Cannot find -llapack when doing /usr/bin/ld on CentOS 5

I encountered an error when one of our researchers did a compilation of a Fortran Program which requires blas and lapack

$ g77 test.f  -L/usr/lib64/ -llapack -lblas
/usr/bin/ld: cannot find -llapack
collect2: ld returned 1 exit status

I was quite puzzled as I have installed lapack and blas. And it seems that lapack is having issues

To check whether you have the libraries, you can use the command
$ ldconfig -p | grep llapack
 libscalapack.so.1 (libc6,x86-64) => /usr/lib64/libscalapack.so.1
 liblapack.so.3 (libc6,x86-64) => /usr/lib64/liblapack.so.3
So it is not an issue of missing lapack libraries. It is there.

           "On systems which support shared libraries, ld may also search for libraries with exten-
           sions other than ".a".  Specifically, on ELF and SunOS systems, ld will search a direc-
           tory  for  a library with an extension of ".so" before searching for one with an exten-
           sion of ".a".  By convention, a ".so" extension indicates a shared library.

           The linker will search an archive only once, at the location where it is  specified  on
           the  command  line.  If the archive defines a symbol which was undefined in some object
           which appeared before the archive on the command line,  the  linker  will  include  the
           appropriate  file(s)  from  the  archive.   However,  an  undefined symbol in an object
           appearing later on the command line will not cause the linker  to  search  the  archive
           again."


So just do a quick soft-links and the problem was solved
$ ln -s /usr/lib64/liblapack.so.3 /usr/lib64/liblapack.so



Wednesday, November 23, 2011

Unspecified GSS failure from SSH causes slow login

I SSH into one of my server, But I encounter this error, but instead I encounter the follow error. Eventually, after waiting about 15-20 seconds, I'm able to connect to. This was far too long for a LAN-based machine 

$ ssh -v ip_of_remote_server

.....
debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195

debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195

debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195
.....


I was quite puzzled. Although I am using IP address of the server to ssh and have tweaked "UseDNS = no" at /etc/sshd_config. See Resolving Slow SSH Login, In addition, I'm doing SSH public/private key authentication. See Auto SSH Login without Password


But the resolution for this issue was easier than I thought. I just need to ensure /etc/hosts contains both the servers I am ssh from and to and it became very quick.

If you are using DNS instead of /etc/hosts, do take a look that your DNS settings at /etc/resolv.conf

Other Issues on SSH, you may want to read about
  1. SSH Error : Permission denied (publickey,gssapi-with-mic,password)
  2. Resolving Slow SSH Login

Tuesday, November 22, 2011

List of Intel Xeon and AMD Microprocessors with pricing

The listing Intel Xeon Microprocessors with pricing from Wikipedia in $USD is very useful for price comparison and budgeting. See Wikipedia List of Intel Xeon Microprocessors

Similarly the listing of AMD Microprocessors from Wikipedia is very informative. But sadly no price listing
List of AMD Opteron microprocessors

Monday, November 21, 2011

Brief overview of Valgrind usage

This write-up covers some very basis commands. But I will try to list out some of the other collections of tutorial and reading to complement this lack of information. I'm assuming that you have compiled the program as written in Compiling Valgrind on CentOS 5 One of the most commonly used command in Valgrind is
# valgrind --tool=memcheck --leak-check=full ./my_program
Commonly-used Options
S/No Command Option Description
1 --leak-check=<no|summary|yes|full> [default: summary] When enabled, search for memory leaks when the client program finishes. If set to summary, it says how many leaks occurred. If set to full or yes, it also gives details of each individual leak.
2 --show-reachable=<yes|no> [default: no] When disabled, the memory leak detector only shows "definitely lost" and "possibly lost" blocks. When enabled, the leak detector also shows "reachable" and "indirectly lost" blocks. (In other words, it shows all blocks, except suppressed ones)
For more information on more details usage of Valgrind of options and how to use,
  1. Valgrind Manual - 4.3 Memcheck Command Options
  2. Using Valgrind to Find Memory Leaks and Invalid Memory Use
  3. Using Valgrind to debug memory leaks

Wednesday, November 16, 2011

Removing a node from the Ganglia Web Frontend

According to the Ganglia_Readme, there is not easy way to remove a single dead node from the list from the ganglia web front-end. To flush the dead node from the record by restarting the the gmetad and gmond processes, you have to add the line at /etc/gmond.conf

globals { 
host_dmax = 3600 
}

The hosts will be removed from host tables when they haven't been heard from in 3600 seconds. See "man gmond.conf" for details.

Sunday, November 13, 2011

Compiling Valgrind on CentOS 5

Valgrind tools automatically detect many memory management and threading bugs, and is able to profile your programs in detail. It runs on the following platforms: X86/Linux, AMD64/Linux, ARM/Linux, PPC32/Linux, PPC64/Linux, S390X/Linux, ARM/Android (2.3.x), X86/Darwin and AMD64/Darwin (Mac OS X 10.6 and 10.7) According to Valgrind, a number of useful tools are supplied as standard.
  1. Memcheck is a memory error detector. It helps you make your programs, particularly those written in C and C++, more correct.
  2. Cachegrind is a cache and branch-prediction profiler. It helps you make your programs run faster.
  3. Callgrind is a call-graph generating cache profiler. It has some overlap with Cachegrind, but also gathers some information that Cachegrind does not.
  4. Helgrind is a thread error detector. It helps you make your multi-threaded programs more correct.
  5. DRD is also a thread error detector. It is similar to Helgrind but uses different analysis techniques and so may find different problems.
  6. Massif is a heap profiler. It helps you make your programs use less memory.
  7. DHAT is a different kind of heap profiler. It helps you understand issues of block lifetimes, block utilisation, and layout inefficiencies.
  8. SGcheck is an experimental tool that can detect overruns of stack and global arrays. Its functionality is complementary to that of Memcheck: SGcheck finds problems that Memcheck can't, and vice versa..
  9. BBV is an experimental SimPoint basic block vector generator. It is useful to people doing computer architecture research and development.
Compilation of Valgrind Compilation is very straightforward......
# tar -xvjpf valgrind-3.7.0.tar.bz2
# cd valgrind-3.7.0
# ./configure --prefix=/usr/local/valgrind-3.7.0
# make; make install
Testing Valgrind
# /usr/local/valgrind-3.7.0/bin/valgrind ls -l
Either this works, or it bombs out with some complaint.

For more information, see Compiling Valgrind on CentOS 5

Saturday, November 12, 2011

Compiling adaptive Poisson-Boltzmann Solver (APBS) on CentOS 5


Adaptive Poisson-Boltzmann Solver (APBS) is a software package for modeling biomolecular solvation through solution of the Poisson-Boltzmann equation (PBE), one of the most popular continuum models for describing electrostatic interactions between molecular solutes in salty, aqueous media......

Installation is very simple. There are many binaries there and you can use the binaries directly. Do note that the latest binaries (apbs-1.3) uses will require glibc 2.7 and greater. If you are using CentOS 5, you may want to use apbs-1.21 binaries or below.

For details on Compiling adaptive Poisson-Boltzmann Solver (APBS) on CentOS 5 on Linux Cluster

Tuesday, November 8, 2011

Using strace as a troubleshooting tool

Taken from Using strace as a troubleshooting tool (linuxcluster.wordpress.com) Strace, when runs in conjunction with a program do output all the calls made to the kernel by the program.

One of quick way to found out what is going on in your program is to do
$ strace -c ./my_hello_world_program
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
74.80    0.002998        1499         2           wait4
21.91    0.000878           4       221           read
0.95    0.000038           0       237         2 mmap
0.77    0.000031          10         3         1 mkdir
0.67    0.000027           0       566       361 open
0.35    0.000014           0        81           mprotect
0.30    0.000012           0        62        37 stat
0.25    0.000010           0       225           close
0.00    0.000000           0        37         1 write
0.00    0.000000           0       132           fstat
0.00    0.000000           0         8           poll
0.00    0.000000           0         2           lseek
0.00    0.000000           0       120           munmap
0.00    0.000000           0        15           brk
0.00    0.000000           0        16           rt_sigaction
................

................

------ ----------- ----------- --------- --------- ----------------
100.00    0.004008                  1990       411 total

If you wish to do a tracing, just do a, you can easily find out the error if there was....
$ strace ./my_hello_world_program
............

............

open("/tmp/openmpi-sessions-root@starfruit-h00.cluster.spms.ntu.edu.sg_0/25979/1/0",
O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory)
munmap(0x2b46e05ef000, 2111200)         = 0
munmap(0x2b46dffe5000, 2102312)         = 0
munmap(0x2b46dfdde000, 2123264)         = 0
munmap(0x2b46e103f000, 2106960)         = 0
munmap(0x2b46e1242000, 2104560)         = 0
munmap(0x2b46e269d000, 2114912)         = 0
munmap(0x2b46e41c9000, 2145008)         = 0
munmap(0x2b46e43d5000, 2162608)         = 0

If you wish the output of strace to a file instead, do use the argument -o
$ strace -o strace_output_file ./my_hello_world_program

If you wish to trace system call, process,network, you can use the "-e trace=file", "-e trace=process", "-e trace=network",
$ strace -e trace=open,close,read,write ./my_hello_w0rld_program
$ strace -e trace=stat,chmod,unlink ./my_hello_world_program
Further Information:
  1. Solutions for tracing UNIX applications (IBM DeveloperWorks)
  2. strace - A very powerful troubleshooting tool for all Linux users (linuxhelp.blogspot.com)
  3. Ten commands every linux developer should know (Linux Journal)

Friday, November 4, 2011

A*Star Computational Resource Centre Software Listings

A*Star which is the main Government funded research organisation in Singapore has a highly effective A*STAR Computational Resource Centre (A*CRC) which provides high performance computational (HPC) resources to the entire A*STAR research community.

They have an interesting software listing which includes,   
  1. Biology and Bioinformatics
  2. Chemistry and Molecular Modeling
  3. Physics and Material Science 
  4. Mathematical, Statistical and Other Utilities
  5. Software Development
  6. System Software

Thursday, November 3, 2011

NetApp posts world-record SPEC SFS2008 NFS benchmark result

NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode......For more information, see NetApp posts world-record SPEC SFS2008 NFS benchmark result

Wednesday, November 2, 2011

Basic Overview and use of NMON on CentOS 5



nmon for Linux – Nigel’s performance Monitor for Linux is a wonderful Swiss Army Knife for Performance Information.You can display multiple screen on the same windows and get information on CPU, Memory, NFS, Network, Disks, Resource, kernel etc


For more information, do look at Basic Overview and use of NMON on CentOS 5 from Linux Cluster

Tuesday, November 1, 2011

Installing ALPS 2.0 from source on CentOS 5

What is ALPS Project?

The ALPS project (Algorithms and Libraries for Physics Simulations) is an open source effort aiming at providing high-end simulation codes for strongly correlated quantum mechanical systems as well as C++ libraries for simplifying the development of such code. ALPS strives to increase software reuse in the physics community. Good information on installing ALPS can be found on ALPS Wiki's Download and install ALPS for Ubuntu 9.10, Ubuntu 10.04, Ubuntu 10.10, Debian and MacOS

Installing ALPS with Boost

# wget http://alps.comp-phys.org/static/software/releases/alps-2.0.2-r5790-src-with-boost.tar.gz
You will need either gfortran or Intel Fortran Compiler. If you are installing using gfortan
# yum install gcc-c++ gcc-gfortran
If you want to use the evaluation tools, you will need to install a newer version of Python than the provided 2.4. You can install from source or use an unofficial repository for binary RPMs. This is not required if you just want to run your compiled simulations (c++ applications), but make sure you still have python headers (specify -DALPS_BUILD_PYTHON=OFF when invoking cmake):
# yum install python-devel
BLAS/LAPACK is necessary. Make sure you have EPEL repository ready. For more information,Red Hat Enterprise Linux / CentOS Linux Enable EPEL (Extra Packages for Enterprise Linux) Repository
# yum install blas-devel lapack-devel
CMake 2.8.0 and HDF5 1.8 need to be installed. There is a wonderful scripts that comes with ALPS that help to compile CMAKE 2.8 and HDF5.1.8 with CentOS 5
$ $HOME/src/alps2/script/cmake.sh $HOME/opt $HOME/tmp
$ $HOME/src/alps2/script/hdf5.sh $HOME/opt $HOME/tmp

Build ALPS

Create a build directory (anywhere you have write access) and execute cmake giving the path to the alps and to the boost directory:
# cmake -D Boost_ROOT_DIR:PATH=/path/to/boost/directory /path/to/alps/directory
For example if the alps precompiled directory is in /root/alps-2.0.2 # cmake -D Boost_ROOT_DIR:PATH=/root/alps-2.0.2/boost /root/alps-2.0.2/alps To install in another directory, set set the variable CMAKE_INSTALL_PREFIX
# cmake -DCMAKE_INSTALL_PREFIX=/path/to/install/directory /path/to/alps/directory
For example:
# cmake -DCMAKE_INSTALL_PREFIX=/usr/local/alps-2.0.2 /root/alps-2.0.2/alps

Build and test ALPS

$ make -j 8
$ make test
$ make install
* HDF5.1.8 binaries and libraries are very useful not only for compiling ALPS but other applications require HDF5.1.8. You may want to consider to move its binaries and libraries to the /usr/local/ directories

Monday, October 31, 2011

Set higher MTU for vSwitch and Virtual Distributed Switch

 Refer to KB: iSCSI and Jumbo Frames configuration on ESX 3.x and ESX 4.x for more details

**Any packet larger than 1500 MTU is a Jumbo Frame. ESX supports frames up to 9Kb (9000 Bytes).

 To set the MTU size for the vSwitch, run the command:

# esxcfg-vswitch -m (MTU_Number) (Vswitch)

where MTU_Number = 9000, Vswitch = Name of the Vswitch

This command sets the MTU for all uplinks on that vSwitch. Set the MTU size to the largest MTU size among all the virtual network adapters connected to the vSwitch.



Refer to KB: Enabling Jumbo Frames for VMkernel ports in a virtual distributed switch


Run this command to change the MTU size for the individual port group:

# esxcfg-vmknic -m 9000 -v (port number) -s (dvs Switch name) 

For Example:
# esxcfg-vmknic -m 9000 -v 115 -s "NewLAN-DVS"


To enable Jumbo Frames on a VMkernel port from vCenter Server:

1) Click Home > Hosts and Clusters > Host > Configuration > Networking.
2) Navigate to the vSphere Distributed Switch tab.
3) Click the VMkernel port (eg: vmk1)
4) Click Manage Virtual Adapters.
5) Select the vmk interface and click Edit.
6) Under the NIC settings, change the MTU value to 9000.
7) Click OK.

Saturday, October 29, 2011

Error in Compiling GotoBLAS2 in Westmere Chipsets

GotoBLAS2 uses new algorithms and memory techniques for optimal performance of the BLAS routines.

When I was tried compiling the GotoBLAS2 on my Westmere chipsets, I followed the "02QuickInstall.txt", I got this error

../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in                                        operation
...........
...........
...........

gcc -O2 -Wall -m64 -DF_INTERFACE_INTEL -fPIC  -DSMP_SERVER -DMAX_CPU_NUMBER=8 -D                                       ASMNAME=strmm_kernel_RN -DASMFNAME=strmm_kernel_RN_ -DNAME=strmm_kernel_RN_ -DCN                                       AME=strmm_kernel_RN -DCHAR_NAME=\"strmm_kernel_RN_\" -DCHAR_CNAME=\"strmm_kernel                                       _RN\" -I.. -UDOUBLE  -UCOMPLEX -c -DTRMMKERNEL -UDOUBLE -UCOMPLEX -ULEFT -UTRANS                                       A ../kernel/x86_64/gemm_kernel_8x4_sse3.S -o strmm_kernel_RN.o
make[1]: *** [sgemm_oncopy.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory `/root/GotoBLAS2/kernel'


I was quite puzzled to why the compilation did not work. I googled and found a wonderful answer Trouble compiling GotoBLAS2 on newer CPU. Basically, you will need to

gmake clean
gmake TARGET=NEHALEM
Eventually yo will get something like

 GotoBLAS build complete.

  OS               ... Linux
  Architecture     ... x86_64
  BINARY           ... 64bit
  C compiler       ... GCC  (command line : gcc)
  Fortran compiler ... INTEL  (command line : ifort)
  Library Name     ... libgoto2_nehalemp-r1.13.a (Multi threaded; Max num-threads is 8)



According to Trouble compiling GotoBLAS2 on newer CPU, the problem appears to be that newer CPUs (Intel X5650 in my case) are not detected properly by the CPU ID routine in GotoBlas2.

The problem with gemm_ncopy_4.S arises because it defines RPRETCHSIZE and WPREFETCHSIZE using #ifdef statements depending on CPU type. There is an entry for #ifdef GENERIC, but that was not set for me in config.h.

Friday, October 28, 2011

IBM DS4000 (FastT) Storage Manager Client on Windows

Boy, this blog "How to Install IBM DS4000 (FastT) Storage Manager Client on Windows" is really helpful in getting information and the software on IBM DS4000 (FastT) Storage Manager Client. Even easier than finding on IBM Website.

......
1) Download the software.
For Windows XP, 2000, 2003, or 2008, on a 32-bit platform, click here.
For Windows Vista , 2003, 2008, on a 64-bit platform, click here
For Windows Vista 32-bit, click here.

If these links don’t work for you, try navigating IBM’s site:
www.ibm.com > support & downloads > fixes, updates, and drivers
Category > (under SYSTEMS) system storage
Product Family > disk systems, Product > DS4800 or whichever DS4000
Select Operating System and Click GO
Click Downloads on the Support and Downloads box.
.............

For more information do go to How to Install IBM DS4000 (FastT) Storage Manager Client on Windows

Wednesday, October 26, 2011

Comparison of File System

Wikipedia did a wonderful job in providing a comparison of various file system
Comparison of file systems. For those who are assessing various file system offerings, do read it.

Another shorter but not as comprehensive is File System Primer from Novell

Another comparison between Lustre File System and Panasas ActiveScale Performance comparison of the Cluster File Systems at the Intel CRT-DC

Monday, October 24, 2011

Looking for solution.... CPU soft lockup detected for CentOS 5 and IBM BladeCentre

One of my IBM BladeCentre node hangs and the log messages generated
"BUG: soft lockup - CPU#1 stuck for 10s! [rpciod/1:3646]"

I have been looking around, but found a reference but not a solution though. I have seen other forum that 
RHEL 5.X CPU soft lockup detected in PAGE_LOCK_ANON_VMA - IBM BladeCenter and System x 

Symptom
While running the Red Hat Enterprise Linux 5.x (RHEL5) family of products, the kernel may report the following error:
BUG: soft lockup - CPU#XX stuck for 10s!

Where XX can be the number of any processor in the system.
The associated stack backtrace points to page_lock_anon_vma as the code running at the time of the soft lockup detection.

Affected configurations
The system is configured with at least one of the following:
  • Red Hat Enterprise Linux 5, any update
This tip is not system specific.
This tip is not option specific.
Note: This does not imply that the network operating system will work under all combinations of hardware and software. Please see the compatibility page for more information: http://www.ibm.com/servers/eserver/serverproven/compat/us/

Solution
This behavior will be corrected in subsequent families of Red Hat products. For more information, contact Red Hat at the following URL:
http://www.redhat.com/about/contact/

Sunday, October 23, 2011

What is nearline SAS Hard Disk?

"Nearline" refers to the lower rotational speed hard disk which is usually refer to the high-capacity SATA hdd. On the other hands, "SAS is a enterprise-class drive which supposedly has a more robust mechanical specification and a controller/firmware optimized for high-volume I/O, manageability, and better error detection and correction".

So inline SAS really means  standard consumer-class SATA drives wtih SAS interface.

Large capacity with enterprise interface.....Not too bad.

For more inforamtion on Nearline, see http://en.wikipedia.org/wiki/Nearline_storage

Saturday, October 22, 2011

Myricom DBL 2.0 Achieves Lowest UDP and TCP Latency for High Frequency Trading

For the full article, see Myricom DBL 2.0 Achieves Lowest UDP and TCP Latency for High Frequency Trading

Myricom DBL 2.0 software has benchmarked application-to-application UDP latency of under 3.5 microseconds and transparent sockets TCP latency of 4.0 microseconds. For HFT applications, DBL enables unmatched networking performance for UDP multicast and TCP order execution, all over industry-standard 10-Gigabit Ethernet
 .......
DBL reduces latency by microseconds for existing applications running on standard TCP/UDP Ethernet networks. With the DBL solution, end-users can achieve extreme performance without rewriting their applications or resorting to specialty networks such as Infiniband. DBL provides transparent acceleration in both Linux and Windows environments.
......
In addition to extremely low UDP and TCP communication latency, DBL 2.0 delivers repeatable low latency, rather than unpredictable and variable latency performance found with competing solutions. Repeatable low latency performance is critical, as packet delay or loss in mission-critical trading and order environments can be devastating to the traders' bottom line.

Wednesday, October 19, 2011

What is SCSI RDMA Protocol (SRP)?

What is SCSI RDMA Protocol (SRP)?

"The SCSI RDMA Protocol (SRP) is an emerging industry standard protocol for utilizing block storage devices over an InfiniBand™ fabric. The use of RDMA makes higher throughput and lower latency possible than what is possible through e.g. the TCP/IP communication protocol. RDMA is only possible with network adapters that support RDMA in hardware."

Here is the diagram. With this, you can use Infiniband as an alternative interconnect instead of relying on Fibre Channel. The advantages of Infiniband is obvious. It has tremendous high throughput and low latency which is important for High Read -Write. 

Using dd to test and analyse read and write performance

According to Wikipedia, dd is a common UNIX program whose primary purpose is the low-level copying of raw data. There are many usage of dd, but for this blog we will use dd to test and analyse read and write performance of file system.

# dd if=/dev/zero of=/home/myaccount/outfile bs=4M count=4096 

4096+0 records in
4096+0 records out
17179869184 bytes (17 GB) copied, 433.088 seconds, 39.7 MB/s

if = input file
of = output
bs = block size
count = file size in kb

Tuesday, October 18, 2011

Malformed database image issue with yum

Today I was doing a yum install after updating my LVM and I suffered a "malformed database image issue". This error can be easily rectify. Just do a

# yum clean dbcache

Then do a
# yum check

Monday, October 17, 2011

Extend LVM on Vmware Linux Guest

One of my mirror ran out of space today. I've come across an excellent article on How to extend LVM on Vmware Guest running Linux by Edward's Blog. Tried his tutorial and it work without a hitch.


Saturday, October 15, 2011

Which File System Blocksize is suitable for my system?

Taken from IBM Developer Network "File System Blocksize"

Although the article has referenced to General Parallel File System (GPFS), but there are many good pointers System Administrators can take note of.

Here are some excerpts from the article........ 

This is one question that many system administrator asked before we start preparing the system. How do choose a blocksize for your file system? IBM Developer Network (File System Blocksize) recommends the following block size for various type of application.


IO Type Application Examples Blocksize
Large Sequential IO Scientific Computing, Digital Media 1MB to 4MB
Relational Database DB2, Oracle 512kb
Small I/O Sequential General File Service, File based Analytics,Email, Web Applications 256kb
Special* Special 16KB-64KB

What if I do not know my application IO profile?
Often you do not have good information on the nature of the IO profile or the applications are so diverse it is difficult to optimize for one or the other. There are generally two approaches to designing for this type of situation separation or compromise.

Separation
In this model you create two file systems, one with a large file system blocksize for sequential applications and one with a smaller block size for small file applications. You can gain benefits from having file systems of two different block sizes even on a single type of storage. Or you can use different types of storage for each file system to further optimize to the workload. In either case the idea is that you provide two file systems to your end users, for scratch space on a compute cluster for example. Then the end users can run tests themselves by pointing the application to one file system or another to and determining by direct testing which is best for their workload. In this situation you may have one file system optimized for sequential IO with a 1MB blocksize and one for more random workloads at 256KB block size.

Compromise
In this situation you either do not have sufficient information on workloads (i.e. end users won't think about IO performance) or enough storage for multiple file systems. In this case it is generally recommended to go with a blocksize of 256KB or 512KB depending on the general workloads and storage model used. With a 256KB block size you will still get good sequential performance (though not necessarily peak marketing numbers) and you will get good performance and space utilization with small files (256KB has minimum allocation of 8KB to a file). This is a good configuration for multi-purpose research workloads where the application developers are focusing on their algorithms more than IO optimization.

Friday, October 14, 2011

How to check FileSystem Block Size on Linux

In case  you wish to find out what Block Size your system is using in using, you can use the following commands to check

# tune2fs -l /dev/sda1 | grep -i 'block size'
Block size:1024

# blockdev --getbsz /dev/sda1
1024

Thursday, October 13, 2011

Tuning rsize and wsize on NFS for a 10GbE network

Taken from Myricom Site "Do you have recommendations for tuning NFS on a 10GbE network"

  1. Use Recent Linux Kernel 2.6.19 or later. CentOS 6 will be a good candidate to implement.
  2. On /etc/fstab, you can set rsize=1048576,wsize=1048576
  3. You can use the above buffers on NFSv3
  4. Do note that for Linux Kernel 2.18 and below, the rsize and wsize is 32KB.

Monday, October 10, 2011

Using mpstat to display SMP CPU statistics

mpstat is a command-line utilities to report CPU related statistics. For CentOS, to install mpstat, you have to install the sysstat package (http://sebastien.godard.pagesperso-orange.fr/)
# yum install sysstat
1. mpstat is very straigtforward. Use the command below. On my 32-core machine,
# mpstat -P ALL
11:10:11 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
11:10:13 PM  all   40.75    0.00    0.03    0.00    0.00    0.00    0.00   59.22   1027.50
11:10:13 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1000.50
11:10:13 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM    2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM    7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00     16.50
11:10:13 PM    9    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   10    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   11    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   12  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00     10.50
11:10:13 PM   13    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   14   99.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   15  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   17    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   18    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   19  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   20  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   21    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   22    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   23    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   24  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   25  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   26    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
11:10:13 PM   27  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
11:10:13 PM   28    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
where CPU - Processor number. The keyword all indicates that statistics are calculated as averages among all processors.
%user - Show the percentage of CPU utilization that occurred while executing at the user level (application).
%nice -
Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.
%sys
- Show the percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing interrupts or softirqs.
%iowait
- Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%irq
- Show the percentage of time spent by the CPU or CPUs to service interrupts.
%soft
- Show the percentage of time spent by the CPU or CPUs to service softirqs. A softirq (software interrupt) is one of up to 32 enumerated software interrupts which can run on multiple CPUs at once.
%steal
- Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%idle - Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
intr/s - Show the total number of interrupts received per second by the CPU or CPUs.


2. Getting average from mpstat To get an average you have to invoke the interval and count argument. In the example, interval is 2 second for 5 count
# mpstat -P ALL 2 5
At the end of the statistics report, you will see an average
Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
Average:     all   40.76    0.00    0.03    0.00    0.00    0.00    0.00   59.21   1047.50
Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1000.60
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:       2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:       4   99.90    0.00    0.10    0.00    0.00    0.00    0.00    0.00      0.00
Average:       5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:       6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:       7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:       8    0.00    0.00    0.10    0.00    0.00    0.00    0.00   99.90     17.30
Average:       9    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      10    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      11    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      12   99.90    0.00    0.00    0.00    0.00    0.10    0.00    0.00     29.70
Average:      13    0.00    0.00    0.10    0.00    0.00    0.00    0.00   99.90      0.00
Average:      14   99.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00      0.00
Average:      15  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      17    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      18    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      19  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:      20  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:      21    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      22    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      23    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      24  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:      25   99.90    0.00    0.10    0.00    0.00    0.00    0.00    0.00      0.00
Average:      26    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      27  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:      28    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      29    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
Average:      30  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
Average:      31    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00

Sunday, October 9, 2011

Outline of File Hierarchy Systems in RHEL, CentOS and SL

The location of the files and directories in RHEL or its clone system are based on the Filesystem Hierarchy System (FHS) guidelines. For more information on the Filesystem Hierarchy System (FHS), do read the Filesystem Hierarchy Standard

  1. /bin/ (Essential Commands for admins and users)
  2. /usr/bin/ (Common commands for admins and users)
  3. /sbin/ (Essential commands for admins)
  4. /usr/sbin (Common commands for admins)
  5. /tmp/ (Temporary files for all users)
  6. /usr/local/ (Location for locally-installed software indepndent of operating systems updates)
  7. /usr/share/man (Manual Pages)
  8. /usr/src (source code)
  9. /var/ (variable files such as spool and log files)
  10. /var/log/ (Log files)
  11. /etc/ (Configuration files)
  12. /proc/ (Kernel virtual file system)
  13. /dev/ (Device file)
Much of these information is derived from  "Red Hat Enteprise Linux Administration Unleased"

Saturday, October 8, 2011

Avoiding DNS Lookup for Apache 2

If you wish to avoid situations where you do not wish to do DNS lookup for the client machines which will slow Apache performance. To do it is quite quick by setting the HostNameLookups directive to off at /etc/httpd/conf/httpd.conf

HostNameLookups Off

Thursday, October 6, 2011

A good read - Dissecting shared libraries

This article "Dissecting shared libraries" from IBM DeveloperWorks is a good read if you wish to have a deeper understanding on shared libraries.

Monday, October 3, 2011

Troubleshooting Blade Management Module connectivity issues

This article is a sub-set of the full document from IBM "Troubleshooting Management Module connectivity issues"



Solution

The Management Module (MM) and the Advanced Management Module (AMM) are the central points of management for the IBM BladeCenter chassis. As such, when the MM is not responsive, the ability to perform normal management on the chassis is significantly compromised. This document covers four different symptoms related to MM connectivity failures: (1) cannot login to the web or telnet interface because of USERID and/or PASSWORD failures. (2) cannot get any network response from the MM, and (3) the MM responds to network pings, but either the web interface or telnet interface does not respond. (4) MM failover does not work.

Throughout this document, "MM" will be used to mean either the MM or AMM. The term AMM will only be used to point out any differences between the two.

When troubleshooting MM connectivity problems, there are a few common procedures that are used in several situations.



Reset the IPaddress of the MM (this procedure does not work on the AMM)

When the MM is restored to its default TCPIP configuration, the Ethernet port on the MM will attempt to get a DHCP address. Disconnect the Ethernet cable if this is not wanted. With the Ethernet cable disconnected, the MM will search for a DHCP server for five minutes, then timeout and take the address 192.168.70.125/255.255.255.0.

Before resetting the MM to its default configuration, have a laptop local to the chassis that can connect to the MM with a cross-over cable (the AMM supports either cable type). Make sure that the laptop is configured with the IPaddress 192.168.70.100/255.255.255.0 so it will not conflict with any address on the chassis. To reset the TCPIP address on the MM, insert a paper-clip into the hole on the back of the MM labeled "IPreset" until it depressed the button inside. Hold it there for just under three seconds, then remove the paper clip. That resets the MM's Ethernet interface to its default configuration.



Reset the IPaddress of the AMM using the serial cable

The AMM has a port for ethernet and serial connectivity. The serial port is at the top of the AMM, just above the video connection. To connect to the serial port, insert one end of a straight-through ethernet cable in the AMM serial port. Attach the other end of the cable to the serial dongle whose pinouts are described in the AMM Installation Guide ("Serial connection," near the end of Chapter 3).

The default serial settings for the AMM are 57k, 8 data bits, No parity, 1 stop bit, flow control off. Once connected to the serial console, login as usual. Create a basic config for the external interface with the following commands (system: x is either system:mm 1 for the AMM in slot 1 or system:mm 2 for the AMM in slot 2).

use static ip: ifconfig -eth0 -c static -T system:mm x

IPaddress: ifconfig -eth0 -i ip-address -T system:mm x

subnet mask ifconfig -eth0 -s subnet mask -T

system:mm x

gateway: ifconfig -eth0 -g IPaddress of gateway -T

system:mm x

They can be combined into one long command as follows:

ifconfig -eth0 -i ip_address -s subnet mask -g IPof gateway -c static -T system:mm x






Reset the MM to its default configuration

One should remember that resetting the MM to defaults turns off the external ports for all four I/O modules, which will cut off all network and fibre connectivity. Therefore, this operation should only be done when the chassis is in a maintenance window and can be off-line for a short period of time. Also, when the MM is restored to its default configuration, it will attempt to get a DHCP address. Disconnect the Ethernet cable if a DHCP address is not wanted. The MM will search for a DHCP server for five minutes, then timeout and take the address 192.168.70.125/255.255.255.0. Before resetting the MM to its default configuration, have a laptop local to the chassis that can connect to the MM with a cross-over cable (the AMM supports either cable type). Make sure that the laptop is configured with the IPaddress 192.168.70.100/255.255.255.0 so it will not conflict with any default address on the chassis.

If the MM is accepting web logins, the default configuration can be restored in the web GUI at:

Select (MM) MM Control, click Restore Defaults, and then click Restore Defaults

Select (AMM) MM Control, click Configuration Mgmt, then click Restore Defaults or click Restore Defaults Preserve Logs

If neither login service is working, the default configuration can be restored by accessing the back of the MM. On the back of the MM, there is a pin hole that is large enough for a paper clip. It is labeled "IPreset." In addition to resetting the IPaddress, pushing a paper clip in for the right amount of time resets the entire MM configuration back to its defaults. To reset the Management Module to the default configuration, including the default login name "USERID" and password "PASSW0RD," push a paper clip into the pin hole until it hits the button inside and hold it. The amount of time required to hold the pin in varies as follows:

MM with 82D firmware or earlier = push in for 5 seconds, then release the pin for 5 seconds, then push it in for another 10 seconds. The timing is quite precise, make sure a watch with a second hand is available. When the reset starts, the fans will ramp up to full speed, which is clearly audible.

AMM or MM with 82F firmware or later = push in the pin and hold it for 10 seconds. When the reset starts, the fans will ramp up to full speed, which is clearly audible.








Remove and reinsert the MM

Troubleshooting the MM sometimes requires physically removing it from the slot and re-inserting it. Before removing it, note whether the green Ethernet LED or amber LED are lit. In normal operations with an Ethernet cabled connected, the Ethernet LED will be on, and the ambler LED will be off. The amber LED will come on briefly when the MM is powered on or reset. It is also a good idea to look at the female connectors when the MM is removed and examine the female connectors to confirm they have not been damaged. When both MMs are removed, the fans ramp up to full speed.

This is clearly audible. When re-inserting the MM, listen to hear if the fans return to the previous noise level. If they do, that indicates that the MM has completed its POST process. If they do not, that indicates that there is some other problem with the chassis that the MM is trying to address. For a visual indication that the MM is working correctly, look at the MM directly.

After the MM is re-inserted and an Ethernet cable connected, confirm the status of the green Ethernet LED and the amber LED. If the amber LED stays on, that indicates a fault in the MM.





Symptom 1: Cannot login due to bad userid or password

If a user makes five unsuccessfull login attempts, the MM will stop accepting logins for a period of time. Two minutes is the default lockout time, though this is configurable in the MM interface at MM Control then click Login Profiles.

If login fails through both the web and telnet interface, resetting the MM to the default login of "USERID" and "PASSWORD" can be accomplished by following the procedure "Reset the MM to its default configuration." The default login ID and password are case sensitive, and in "PASSWORD." a zero is used for the letter "O."

If USERID/PASSWORD login problems still exist after resetting the MM to defaults, contact IBM support. If the MM does not have network connectivity after resetting defaults, follow the steps below for the appropriate symptom.







Symptom 2: MM does not respond to any network connection
If the MM does not respond to any remote network connection, troubleshooting will need to be done at the chassis. The first step is to find a laptop that can login to other MMs and connect it to the MM with a cross-over cable (either a cross-over or straight through can be used for the AMM). Verify that the IPconfiguration on the laptop puts it in the same subnet as the MM, and verify that the laptop is not running a local firewall. Try to connect to the MM via a web browser, telnet, and ping. Depending on the results you get, take the following steps:

If the laptop has complete access to the MM when connected locally, then the previous connectivity problems are most likely due to network problems on the customer's LAN, or the other workstation the customer used to access this MM.

If the laptop can ping the MM, but cannot connect via web browser or telnet, go to symptom

If the laptop cannot ping the MM, take the following steps to try and restore connectivity.

Clear the arp cache on the laptop. - If the chassis has a redundant MM, fail over to it and attempt to connect to it.

If the chassis only has one MM, move it to the other slot, following the procedure "Remove and reinsert the MM."

Follow the procedure "Reset the IPaddress of the MM" or "Reset the IPaddress for the AMM using the serial cable."

Follow the procedure "Reset the MM to its default configuration"

If these all fail, contact IBM support for assistance.



Symptom 3: Cannot connect to the MM using the web browser/telnet/ssh, but can ping the MM
The MM runs a few network servers that enable users to login and manage the chassis. If basic connectivity via 'ping' is functioning, but one or more of the login services is not working (for example, web server, telnet server), the problem is due to a configuration error or firmware defects. It is never a hardware failure. When the MM will respond to a ping, but any one of the login services does not respond, take the the following steps:

Ensure that a supported web browser is being used.

If possible, verify whether the MM is running the network servers on their default network ports. If all logins fail, check with the administrator for the BladeCenter. If it is possible to login to the web interface, select MM Control and click Port Assignments. There is no way to get that information in the telnet interface.

Verify whether this workstation can connect to other MMs. If it cannot, the problem is most likely due to a firewall running on the client workstation or the network. Shutdown any firewalls on the client machine and try again. If the client still has problems connecting to multiple MMs, consult the network administrator for the LAN.

Restart the MM. If the MM responds to network logins after it has been restarted, this is most likely an MM firmware defect. Download the changelog for the current MM or AMM code and see if any similar issues have been resolved. If not, contact IBM Support for additional assistance.

At this point. troubleshooting must continue with a laptop or other workstation local to the MM. Find a laptop which can connect to other MMs, and connect it directly to the MM with a crossover cable (both cross-over and straight-though work for the AMM). Verify that the Ethernet link is up and the laptop is configured so it is on the same subnet as the MM. If the laptop can ping the MM, attempt to login to the MM with a supported web browser. If that works, contact the network administrator for assistance troubleshooting the network.

If the MM still does not allow logins over the WEB interface at this point, restore the MM to its defaults with the procedure "Reset the MM to its default configuration." If this does not restore connectivity, contact IBM Support.



Symptom 4: Failover of MM to redundant MM does not work

When there are two MMs in a chassis, one MM is active and the second MM is on standby. When a user initiates a failover from the primary to redundant, the primary sends a message to redundant to become the primary, then reboots itself. On rare occasions, this does not work. When it does not, take the follow steps to resolve it:

Examine the MM Event log and the MM BIST log to see if any errors have been detected.

Physically remove the primary MM and see if the redundant MM boots successfully. If it does not, move it to the other slot and see if it can boot in that slot

Reset the MM to defaults using the procedure "Reset the MM to its default configuration."

Repeat the failover process with both MMs. If it still does not work, contact IBM support

Wednesday, September 28, 2011

Getting `GLIBCXX_3.4.9 not found' when starting the license manager for MATLAB

If you are starting the latest version of MATLAB License Server,  you may encounter this error`GLIBCXX_3.4.9 not found'

For more information do look at MathWorks "Why do I get an error 'GLIBCXX_3.4.9 not found"

According to the mathwork website,



Solution:

This issue is caused by a missing or outdated libstdc++.so.6 as required by the keycheck application (R2011a) or the MLM vendor daemon (R2011b). Both the R2011a keycheck and R2011b MLM vendor daemon require libstdc++.so.6.0.10. Refer to your operating system documentation for information on how to update or install a missing library.

If the necessary version of the library is not available for your Linux distribution it can be copied and installed from the MATLAB installation files following the instructions below:

NOTE: $MATLAB refers to the MATLAB installation location (ex: /usr/local/MATLAB/R2011b)
NOTE: $ARCH refers to the machine architecture (ex: glnx86 for Linux 32-bit or glnxa64 for Linux 64-bit)

If MATLAB is installed in addition to the FlexNet license manager, skip directly to step 3.

1. Create a subdirectory within the MATLAB installation folder as shown below:

[root@localhost ~] mkdir -p $MATLAB/sys/os/$ARCH


2. Copy the libstdc++.so.6.0.10 library from the MATLAB installation files (either an installation DVD or the extracted downloaded installer archive) into the newly created directory:

[root@localhost ~] cp /media/MATLAB_R2011b/bin/$ARCH/libstdc++.so.6.0.10 $MATLAB/sys/os/$ARCH


3. Run 'ldconfig' to create symbolic links to the new library and update the dynamic linker cache:

[root@localhost ~] ldconfig $MATLAB/sys/os/$ARCH

World's largest Cloud Storage System designed by SDSC



San Diego Supercomputer Center (or SDSC) has designed the World's largest Cloud Storage System that is specifically targeted for academic and research use. Dubbed the SDSC Cloud, the total cloud capacity will start at raw capacity of 5.5 PBand is scalable to "Hundreds of Petabytes". Rates are starting at US$3.46 a month for 100GB storage.

For more information do look at
  1. San Diego Supercomputer Center launches world's largest academic cloud storage system 
  2. SDSC Cloud Storage Services

Monday, September 26, 2011

Understanding VXLAN Virtual-Physical-Cloud L2/L3 Networks by ARISTA

Interesting Article....From ARISTA

Understanding VXLAN Virtual-Physical-Cloud L2/L3 Networks by ARISTA (pdf)

Video interview - Vmware CTO Steve Herrod and Arista Founder, CDO, Chairman Andy Bechstolsheim

Video interview - Vmware CTO Steve Herrod andArista Founder, CDO, Chairman Andy Bechstolsheim

  1. Introduction to the CTO Video Series with Steve Herrod
  2. Part I: The State of Cloud Computing: Applications as a Service
  3. Part II: The Semantics of Cloud
  4. Part III: Moore's Law and its Impact on Software Infrastructure and Network Capacity
  5. Part IV: Storage: From Mechanics to Silicon and Network Scalability
  6. Part V: Andy's Current Software Focus at Arista
  7. Part VI: Power Efficiency and Chip Design
  8. Part VII: Power and the Datacenter: The Impact of Software Improvments
  9. Part VIII: Private vs Public Cloud: Transparency and Economics
  10. Part IX: Security and the Cloud
  11. Part X: Arista and the Importance of Low Latency
  12. Part XI: Today's Financial Trading Model
  13. Part XII: What Technologies are Interesting to Andy?
  14. Part XIII: Who Does Andy Admire: Einstein vs The Hardy Boys
  15. Part XIV: The Importance of Science Education
  16. Part XV: Wrap-up

Server has no node list when executing pbsnodes -s

If you see this error "Server has no node list" when you execute "pbsnodes -s 192.168.1.1" (your torque server DNS or IP), it is due to the missing "nodes" file that was supposed to be at /var/spool/torque/server_priv/

The node file should show something like this
## This is the TORQUE server "nodes" file.
##
## To add a node, enter its hostname, optional processor count (np=),
## and optional feature names.
##
## Example:
##    host01 np=8 featureA featureB
##    host02 np=8 featureA featureB
##
## for more information, please visit:
##
## http://www.clusterresources.com/torquedocs/nodeconfig.shtml

compute-c00     np=12
compute-c01     np=12
compute-c02     np=12
compute-c03     np=12
compute-c04     np=12

Restart the pbs_sched and pbs_server services
# service pbs_server restart
# service pbs_sched restart