milearning: May 2016

Monday 23 May 2016

NFS common errors and troubleshooting - Linux/Unix

I have seen some of the most common NFS Error/Issues which occurs in very common now and then to most of Linux/Unix based system admins. So I decided to put at one palace. Hope this helps most of them.

Environment: Linux/Unix

Error: "Server Not Responding"

Check your NFS server and the client using RPC message and they must be functional/online.

use ping, traceroute to check are they reaching each other, if not check your NIC using ethtool to verify IP address.

sometimes due to heavy server or network loads causes the RPC message response to time out causing error message. try to increase timeout option.

Error: "rpc mount export: RPC: Timed out "

NFS server or client was unable to resolve DNS. check forward/reverse DNS name resolution works.

Check your DNS servers or /etc/hosts

Error: "Access Denied" or "Permission Denied"

check export permission for the NFS file systems.

#showmount -e nfsserver ==> client

#exportfs -a ==> server

check you dont have any syntax issues in file /etc/exports(e.g space, permissions, typos..etc)

Error: "RPC: Port mapper failure - RPC: Unable to receive"

NFS requires both NFS service and portmapper service running on both client and the server

#rpcinfo -p

#/etc/init.d/portmap status

if not, start the portmap service

Error: "NFS Stale File Handle"

system call 'open' calls to access NFS file in the same way application uses local file they by returns a file descriptor or handle which programs useses I/O commands to identify the file manipulations

When an NFS file share is either unshared or NFS server changes the file handler, and any NFS client which attempts to do further I/O on the share will receive the 'NFS Stale File Handler'.

on the client :

umount -f /nfsmount or if it is unable to inmount and remount

kill the processes which uses that /nfsmount

or

incase if above options didn't work, you can reboot the client to clear the stale NFS.

Error: "No route to host"

this could be reported when client attempts to mount the NFS file system, even when the client can ping them successfully.

This can be due to RPC messages being filtered by either host firewall, client firewall or network switch. verify firewall rules.

stop suing iptables and try to check the port 2049

Hope this helps all who might use NFS most of the times. I have figured out these commonly in my experience.

Thanks for sharing !

Sunday 15 May 2016

CentOS/RHEL 7 kernel dump & debug

Applies : CentOS / RHEL / OEL 7

Arch : x86_64

When kdump enabled, the system is booted from the context of another kernel. This second kernel reserves a small amount of memory, and its only purpose is to capture the core dump image in case the system crashes. Since being able to analyze the core dump helps significantly to determine the exact cause of the system failure.

Configuring kdump :

kdump service comes with kexec-tools package which needs to be installed

#yum install kexec-tools

Modify the amount of memory needs to be configured for kdump and set crashkernel=<size> parameter

# cat /etc/default/grub

GRUB_TIMEOUT=5

GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"

GRUB_DEFAULT=saved

GRUB_DISABLE_SUBMENU=true

GRUB_TERMINAL_OUTPUT="console"

GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M vconsole.keymap=us rhgb quiet"

GRUB_DISABLE_RECOVERY="true"

Re-generate grub and reboot to make kernel parameter effect

# grub2-mkconfig -o /boot/grub2/grub.cfg

Generating grub configuration file ...

Found linux image: /boot/vmlinuz-3.10.0-123.el7.x86_64

Found initrd image: /boot/initramfs-3.10.0-123.el7.x86_64.img

Warning: Please don't use old title `CentOS Linux, with Linux 3.10.0-123.el7.x86_64' for GRUB_DEFAULT, use `Advanced options for CentOS Linux>CentOS Linux, with Linux 3.10.0-123.el7.x86_64' (for versions before 2.00) or `gnulinux-advanced-1a06e03f-ad9b-44bf-a972-3a821fca1254>gnulinux-3.10.0-123.el7.x86_64-advanced-1a06e03f-ad9b-44bf-a972-3a821fca1254' (for 2.00 or later)

Found linux image: /boot/vmlinuz-0-rescue-ae1ddf63f5e04857b5e89cd8fcf1f9e1

Found initrd image: /boot/initramfs-0-rescue-ae1ddf63f5e04857b5e89cd8fcf1f9e1.img

done

Modify Kump in /etc/kdump.conf

By default vmcore will be stored in /var/crash directory and if you like it needs to be dumped in which ever partition or disk or you want or NFS it must be defined here.

ext3 /dev/sdd1

net nfs.yourdomain.com:/export/dump

compress the vmcore file to reduce the size

core_collector makedumpfile -c

when crash is captured, root fs will be mounted and /sbin/init is run. change the behaviour as below

default reboot

Start your kdump:

# cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-123.el7.x86_64 root=UUID=1a06e03f-ad9b-44bf-a972-3a821fca1254 ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M vconsole.keymap=us rhgb quiet

# grep -v '#' /etc/sysconfig/kdump | sed '/^$/d'

KDUMP_KERNELVER=""

KDUMP_COMMANDLINE=""

KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug"

KEXEC_ARGS=""

KDUMP_BOOTDIR="/boot"

KDUMP_IMG="vmlinuz"

KDUMP_IMG_EXT=""

# systemctl enable kdump.service

# systemctl start kdump.service

# systemctl is-active kdump

active

Test your configuration

# echo 1 > /proc/sys/kernel/sysrq

# echo c > /proc/sysrq-trigger

You could see that the crash was generated and we could install debug kernel packages to analyse crash.

#yum install crash

I was able to download from https://oss.oracle.com/ol7/debuginfo/ and check your kernel version to download the version of debug kernel.

#rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-123.el7.x86_64.rpm \

kernel-debuginfo-3.10.0-123.el7.x86_64.rpm \

kernel-debug-debuginfo-3.10.0-123.el7.x86_64.rpm

# ls -lh /var/crash/127.0.0.1-2016.05.15-04\:50\:40/vmcore

-rw-------. 1 root root 168M May 15 04:51 /var/crash/127.0.0.1-2016.05.15-04:50:40/vmcore

# crash /var/crash/127.0.0.1-2016.05.15-04\:50\:40/vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux

WARNING: kernel version inconsistency between vmlinux and dumpfile

KERNEL: /usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux

DUMPFILE: /var/crash/127.0.0.1-2016.05.15-04:50:40/vmcore

CPUS: 1

DATE: Sun May 15 04:50:38 2016

UPTIME: 00:10:24

LOAD AVERAGE: 0.02, 0.07, 0.05

TASKS: 104

NODENAME: slnxcen01

RELEASE: 3.10.0-123.el7.x86_64

VERSION: #1 SMP Mon Jun 30 12:09:22 UTC 2014

MACHINE: x86_64 (2294 Mhz)

MEMORY: 1.4 GB

PANIC: "Oops: 0002 [#1] SMP " (check log for details)

PID: 2266

COMMAND: "bash"

TASK: ffff880055650b60 [THREAD_INFO: ffff880053fb2000]

CPU: 0

STATE: TASK_RUNNING (PANIC)

crash>

crash> bt

PID: 2266 TASK: ffff880055650b60 CPU: 0 COMMAND: "bash"

#0 [ffff880053fb3a98] machine_kexec at ffffffff81041181

#1 [ffff880053fb3af0] crash_kexec at ffffffff810cf0e2

#2 [ffff880053fb3bc0] oops_end at ffffffff815ea548

crash> files

PID: 2266 TASK: ffff880055650b60 CPU: 0 COMMAND: "bash"

ROOT: / CWD: /root

FD FILE DENTRY INODE TYPE PATH

0 ffff880053c47a00 ffff8800563383c0 ffff880055bad2f0 CHR /dev/tty1

1 ffff8800542a9100 ffff88004dd4ff00 ffff88004dc0b750 REG /proc/sysrq-trigger

That will conclude the article.

References :

1. http://sunlnx.blogspot.in/2013/07/kernel-crash-reportcrash-dump-analysis.html

2. http://people.redhat.com/anderson/crash_whitepaper/

milearning

pages