Tuesday, December 30, 2014

Forcibly kill or purge the Job in the Torque Scheduler

When there is a job stuck and cannot be remove by a normal qdel, you can use the command qdel -p jobid. Do note that this command should be used when there is no other way to kill off the job in the usual fashion especially if the compute node is unresponsive.

# qdel -p jobID

References:
  1. [torqueusers] qdel will not delete

Thursday, December 25, 2014

Checking for Torque Server Version Number

To check Torque Version Number, do issue the command
# qstat --version
Version: 4.2.7
Commit: xxxxxxxxxxxxxxxxxxxxxx

cannot change directory to /home/user1. Permission denied on NFS mount

If you encountered the "cannot change directory to /home/user1. Permission denied on NFS mount" when you do a su --login user1. Do check the base directory permission. If the owner and group /home is root.root, do remember to chmod 755

# ls -ld /home
drwx------ 7 root root 8192 Dec 22 15:13 home

Change the permission to
# chmod 755 /home
drwxr-xr-x 7 root root 8192 Dec 22 15:13 home

Tuesday, December 23, 2014

Displaying SPICE on the VM network for RHEV 3.4

For more information, do take a look at my blog Displaying SPICE on the VM network for RHEV 3.4

The key issue is that after after selecting the network to house the "Display Network", do remember to boot all the VMs

Monday, December 22, 2014

Using log collector in RHEV 3.3 and above to collect full log

The Log Collector Utility for RHEV 3 is located in /usr/bin/rhevm-log-collector and is provided by the rhevm-log-collector package installed on the RHEV manager system.

 1. To collect all the information, use command
# engine-log-collector
INFO: Gathering oVirt Engine information...
INFO: Gathering PostgreSQL the oVirt Engine database and log files from localhost...
Please provide the REST API password for the admin@internal oVirt Engine user (CTRL+D to skip):
About to collect information from 1 hypervisors. Continue? (Y/n): y
INFO: Gathering information from selected hypervisors...
INFO: collecting information from 192.168.50.56
INFO: finished collecting information from 192.168.50.56
Creating compressed archive...

2. To collect information from selected hosts ending with ending in .11 and .15
# engine-log-collector --hosts=*.11,*.15

3. To collect information from the RHEV-M only
# engine-log-collector --no-hypervisors
References
  1. https://access.redhat.com/solutions/61546

Friday, December 19, 2014

Intel NIC driver causing multicast flooding (intermittent wired network disconnection)

Symptom:
The symptom can range from random disconnection to slowness in the entire school/building wired network. Eventually, the cause of this problem was found to be due to intel-chipset nic card (Intel I2xx/825xx series) sending out erratic & massive multicast traffic, causing flooding of the network and high CPU on the switches. The below link are some url which describe the same problem faced by other user environment:


Resolution:
The recommended step to resolve this problem is to upgrade the intel nic card driver to version 19.0 and above.


References:
  1. IPv6 multicast flood during sleep from i217-LM
  2. ICMPv6 'Multicast Listener Report' messages are flooding the local network 
  3. ICMPv6 'Multicast Listener Report' messages flooding the local network

Monday, December 15, 2014

NetApp SteelStore Cloud Integrated Storage Appliance

Taken from NetApp SteelStore

Quick Overview

Use the NetApp® SteelStore™ cloud integrated storage appliance to leverage public and private cloud as part of your backup and archive strategy.

Features (from the website) includes
  • Integrates with all leading backup solutions and all major public and private cloud providers.
  • Offers complete, end-to-end security for data at rest and in flight using FIPS 140-2 certified encryption.
  • Uses efficient, variable-length inline deduplication and compression, reducing storage costs up to 90%.
  • Delivers fast, intelligent local backup and recovery based on local storage.
  • Vaults older versions to the cloud, allowing for rapid restores with offsite protection.
  • Supports policy-based data lifecycle management.
  • Scales to an effective capacity of 28PB per appliance.
Technical Specification

Red Hat Atomic and Containers

Red Hat and Containers

Articles
  1. Small footprint, big impact: Red Hat Enterprise Linux 7 Atomic Host Beta now available
  2. Splitting the Atom: Recapping the First Atomic Application Forum
  3. Containers – There’s No Going It Alone
Atomic Video
  1.  Red Hat Enterprise Linux 7 Atomic Host & Containers
Blog About Atomic Performance
  1.  Performance Testing Red Hat Enterprise Linux 7 Atomic Host Beta on Amazon EC2

Thursday, December 11, 2014

Encountering 'Write Failed: Broken Pipe" suring SSH connection on CentOS

I encountered this error Write Failed: Broken Pipe" during SSH connection. From what I know, it is caused by the Linux Server severing connections that have been idle too long ago.

To solve the issues, you can do the following

1. At your Linux Server Side, you can configure
# vim /etc/ssh/sshd_config

.....
ClientAliveInterval 60
.....


2. At your Client Side
# vim ~/.ssh/config 

.....
ServerAliveInterval 60
.....

Tuesday, December 9, 2014

LRZ HPC Cluster Storage Systems

I thought I pen down the LRZ HPC Cluster Storage Systems usage of NetApp and GPFS File System to support their computing needs.

Storage Systems
SuperMUC has a powerful I/O-Subsystem which helps to process large amounts of data generated by simulations.

Home file systems
Permanent storage for data and programs is provided by a 16-node NAS cluster from Netapp. This primary cluster has a capacity of 2 Petabytes and has demonstrated an aggregated throughput of more than 10 GB/s using NFSv3. Netapp's Ontap 8 "Cluster-mode" provides a single namespace for several hundred project volumes on the system. Users can access multiple snapshots of data in their home directories.

Data is regularly replicated to a separate 4-node Netapp cluster with another 2 PB of storage for recovery purposes. Replication uses Snapmirror-technology and runs with up to 2 GB/s in this setup.

Storage hardware consists of >3400 SATA-Disks with 2 TB each protected by double-parity RAID and integrated checksum


Work and Scratch areas
For highest-performance checkpoint I/O IBM's General Parallel File System (GPFS) with 10 PB of capacity and an aggregated throughput of 200 GB/s is available. Disk storage subsystems were built by DDN.


References:
  1.  SuperMUC Petascale System

Monday, December 8, 2014

Changing local and group ownership

I usually use the following commands to change ownership of a file or directory

# chown username.usergroups myfile.

But sometimes the users name contain a fullstop in the username. In that case, the following
# chown user.name:usergroups myfile

Thursday, December 4, 2014

Yum giving "Cannot retrieve metalink for repository: epel" Error for CentOS 6

When I was updating yum install or doing yum update on CentOS 6.4, I received an error "Cannot retrieve metalink for repository: epel"

The error is because of the scripts in the epel.repo under the mirrorlist pointing to https instead of http. If you amend it to http, the epel repo will work

 At /etc/yum/repos.d/epel.repo, change to

[epel]
.....
.....
mirrorlist=http://mirrors.fedoraproject.org/metalink?repo=epel-6&arch=$basearch

[epel-debuginfo]
.....
.....
mirrorlist=http://mirrors.fedoraproject.org/metalink?repo=epel-debug-6&arch=$basearch

References:
  1. CentOS 6.3 Instance Giving "Cannot retrieve metalink for repository: epel" Error