EDAC Overview

EDAC (Error Detection and Correction) is a set of Linux kernel modules for handling hardware-related errors.

Its major focus has been ECC memory error handling, however it also detects and reports PCI bus parity errors.

EDAC通过sysfs方式获得硬件的信息。检测到问题时,在syslog中可以看到这样的日志:

kernel: EDAC MC1: CE row 0, channel 0, label "": Corrected error (Socket=1 channel=0 dimm=0)

有一个叫做edac-util的工具,可以用来查看更详细的状态报表。

$ edac-util -r
mc1: csrow0: ch0: 43722040 Corrected Errors

上面的输出表示有较多的可修复错误(Corrected Errors - CE)被发现。

edac-util的安装,在RHEL或CentOS上,可以通过yum安装:

yum install edac-util -y

Links

[http://www.kernel.org/doc/Documentation/edac.txt EDAC - Error Detection And Correction]

[http://bluesmoke.sourceforge.net EDAC Project]