Monday 9 December 2013

Server Hardware’s Fault ! - diagnosis & troubleshooting

I will cover some of the common hardware failure you might run into, along with the steps to troubleshoot and confirm them. 

Environment: CentOS_6.3 (32-bit)

Few of the more common pieces of hardware that fail are 
1. Hard drive(HDD).
2. RAM. 
3. Network card failures.
4. Server's temperature.

1. HDD

Hard drive manufacturers all have their own hard drive testing tools, modern hard drives should also support SMART(Self-Monitoring, Analysis and Reporting Technology; often written as SMART), which monitors the overall health of the hard drive and can alert you when the drive will fail soon. 

Install package smarttools, pass the -H option to smartctl to check the health of a drive:


# smartctl -H /dev/sda
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)

Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

SMART Health Status: OK 

The test has been passed without any errors. If it would be there then it would have displayed the WARNING/ERROR 

You can also pull much more information about a hard drive using smartctl with the -a option. That option will pull out all of the SMART information 
about the drive

#smartctl -a /dev/sda

2.RAM 

Tool for detecting memory would be Memtest86  which needs to be installed additionally which is added automatically to your GRUB configurations.


No matter how you invoke it at boot time, once you start Memtest86, it will immediately launch and start scanning your RAM.

If there are found to be few errors, try to change a new RAM and again re-run the test.

3. Network card failures

When a network card or some other network component to which your host is connected starts to fail, you can see it in packet errors on your system. The 
ifconfig command you may have used for network troubleshooting before can also tell you about TX (transmit) or RX (receive) errors for a card.

# ifconfig eth0 | egrep "RX|TX"
          RX packets:2144865318 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2339638100 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000

If you start to see a lot of errors here, then it's worth troubleshooting your physical net-work components. It's possible a network card, cable, or switch port is going bad.

4. Server Temperature

A poorly cooled server can cause premature hard drive failure and premature failure in the rest of the server components as well.  

Linux provides tools that allow you to probe CPU and motherboard temperatures, and, in some cases, the temperatures of PCI devices and even fan speeds. All of this support is provided by the lm-sensors package.

Once the lm-sensors package is installed, run the sensors-detect as root :

#sensors-detect

This interactive script will probe the hardware on your system so it knows
how to query for temperature. 
Once the sensors-detect script is completed, you can pull data about your server by running the command sensors.

# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +46.0°C  (high = +84.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Core 1:       +42.0°C  (high = +84.0°C, crit = +100.0°C)

coretemp-isa-0002
Adapter: ISA adapter
Core 2:       +42.0°C  (high = +84.0°C, crit = +100.0°C)

coretemp-isa-0003
Adapter: ISA adapter
Core 3:       +42.0°C  (high = +84.0°C, crit = +100.0°C)

i5k_amb-isa-0000
Adapter: ISA adapter
Ch. 0 DIMM 0:  +58.0°C  (low  = +118.0°C, high = +124.0°C)
Ch. 1 DIMM 0:  +64.0°C  (low  = +118.0°C, high = +124.0°C)
Ch. 2 DIMM 0:  +72.0°C  (low  = +118.0°C, high = +124.0°C)
Ch. 3 DIMM 0:  +64.0°C  (low  = +118.0°C, high = +124.0°C)

You could sue sensors -f for the temperatures in Fahrenheit.

Examine the airflow around the server and make sure the vents in and out of the server aren't clogged with dust.If you have the bad habit of not rack mounting your servers but instead installing a shelf and stacking servers one on top of the other, that will also contribute to poor airflow and overheating. 

I would like to conclude the article by saying that these are the few of the common failures for the hardware faults.