Tuesday, February 26, 2013

Finding the user who used excessive memory and crash the Server

Using linux tools, we can quickly deduce the users who have crashed the system. For me, one of the harder to trace are researchers who run large memory arrays. And if the array is not calculated correctly, it will consume excessive memory than what the server can provide including physical and swap, the system will crash.

One of the fastest is to check who is the users who is last online and having crash and check with /var/log/messages. I just issue the command.

# last

reboot   system boot  2.6.18-238.9.1.e Mon Feb 25 08:42         (1+01:12)
user1    pts/30       :3.0             Sun Feb 24 15:19 - crash  (17:22)
user1    pts/36       :3.0             Sun Feb 24 14:26 - crash  (18:15)

If you can also take a look at the log file, you can see that the crash appeared after the last user log in.

Feb 24 14:25:19 Head-Node gconfd (user1-29338): Resolved address "xml:readwrite:/home/user1/.gconf" to a writable configuration source at position 0
Feb 24 16:15:11 Head-Node mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=2739215:   Reason code 668 Failure Reason Lost membership in cluster nsd1-nas. Unmounting file systems.
Feb 24 16:15:11 Head-Node mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=2739215:
Feb 24 16:18:08 Head-Node kernel: TCP: time wait bucket table overflow
Feb 24 16:18:08 Head-Node kernel: mmfsd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
Feb 24 16:18:08 Head-Node kernel:
Feb 24 16:18:08 Head-Node kernel: Call Trace:
Feb 24 16:18:08 Head-Node kernel:  [] out_of_memory+0x8e/0x2f3
Feb 24 16:18:08 Head-Node kernel:  [] __wake_up+0x38/0x4f
Feb 24 16:18:08 Head-Node kernel:  [] autoremove_wake_function+0x0/0x2e
Feb 24 16:18:08 Head-Node kernel:  [] __alloc_pages+0x27f/0x308
Feb 24 16:18:08 Head-Node kernel:  [] read_swap_cache_async+0x45/0xd8
Feb 24 16:18:08 Head-Node kernel:  [] swapin_readahead+0x60/0xd3
Feb 24 16:18:08 Head-Node kernel:  [] __handle_mm_fault+0xb62/0x1039
Feb 24 16:18:15 Head-Node kernel:  [] thread_return+0x62/0xfe
Feb 24 16:18:28 Head-Node kernel:  [] do_page_fault+0x4cb/0x874
Feb 24 16:18:53 Head-Node kernel:  [] hrtimer_cancel+0xc/0x16
Feb 24 16:19:16 Head-Node kernel:  [] do_nanosleep+0x47/0x70
Feb 24 16:19:29 Head-Node kernel:  [] hrtimer_nanosleep+0x58/0x118
Feb 24 16:19:33 Head-Node kernel:  [] error_exit+0x0/0x84
 

To manage memory more effectively, do look at
  1. Tweaking the Linux Kernel to manage memory and swap usage  

No comments: