[PATCH EDAC 00/13] Add a driver to report Firmware first errors (via GHES)

From: Mauro Carvalho Chehab
Date: Fri Feb 15 2013 - 08:11:58 EST


There are currently 3 error mechanisms inside the Linux Kernel:
edac, mcelog and ghes.

Unfortunately, not all those error mechanisms will work at the same
time, as accessing the error registers by the BIOS may interfere on
reading them from OS.

So, all those 3 mechanisms need to be integrated, in order to avoid
such problems.

This patch series adds a new EDAC driver that uses "Firmware first"
APEI/GHES as an error report mechanism. It automatically disables
the hardware-driven EDAC drivers when GHES is enabled, preventing
to have both OS and BIOS to read at the very same error mechanisms.

It was tested on a "Lizard Head Pass" Intel machine, equipped with
BIOS SE5C600.86B.99.99.x059.091020121352 (09/10/2012).

Test results:

The driver is properly binding into the EDAC core. This BIOS
announces and sets "Firmware first" mode:

[ 4.537704] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.
[ 4.547644] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.
[ 4.556807] ghes_edac: So, the end result of using this driver varies from vendor to vendor.
[ 4.566260] ghes_edac: If you find incorrect reports, please ask your vendor to fix its BIOS.
[ 4.575811] ghes_edac: This system has 48 DIMM sockets.
[ 4.581687] EDAC DEBUG: ghes_edac_dmidecode: DIMM0: DDR3 size = 8192 MB(ECC)
[ 4.581691] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581695] EDAC DEBUG: ghes_edac_dmidecode: DIMM3: DDR3 size = 8192 MB(ECC)
[ 4.581698] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581702] EDAC DEBUG: ghes_edac_dmidecode: DIMM6: DDR3 size = 8192 MB(ECC)
[ 4.581705] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581708] EDAC DEBUG: ghes_edac_dmidecode: DIMM9: DDR3 size = 8192 MB(ECC)
[ 4.581711] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581715] EDAC DEBUG: ghes_edac_dmidecode: DIMM12: DDR3 size = 8192 MB(ECC)
[ 4.581718] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581722] EDAC DEBUG: ghes_edac_dmidecode: DIMM15: DDR3 size = 8192 MB(ECC)
[ 4.581724] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581728] EDAC DEBUG: ghes_edac_dmidecode: DIMM18: DDR3 size = 8192 MB(ECC)
[ 4.581730] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581734] EDAC DEBUG: ghes_edac_dmidecode: DIMM21: DDR3 size = 8192 MB(ECC)
[ 4.581737] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581741] EDAC DEBUG: ghes_edac_dmidecode: DIMM24: DDR3 size = 8192 MB(ECC)
[ 4.581752] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581756] EDAC DEBUG: ghes_edac_dmidecode: DIMM27: DDR3 size = 8192 MB(ECC)
[ 4.581759] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581763] EDAC DEBUG: ghes_edac_dmidecode: DIMM30: DDR3 size = 8192 MB(ECC)
[ 4.581766] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581769] EDAC DEBUG: ghes_edac_dmidecode: DIMM33: DDR3 size = 8192 MB(ECC)
[ 4.581772] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581776] EDAC DEBUG: ghes_edac_dmidecode: DIMM36: DDR3 size = 8192 MB(ECC)
[ 4.581778] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581782] EDAC DEBUG: ghes_edac_dmidecode: DIMM39: DDR3 size = 8192 MB(ECC)
[ 4.581784] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581788] EDAC DEBUG: ghes_edac_dmidecode: DIMM42: DDR3 size = 8192 MB(ECC)
[ 4.581791] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.581795] EDAC DEBUG: ghes_edac_dmidecode: DIMM45: DDR3 size = 8192 MB(ECC)
[ 4.581797] EDAC DEBUG: ghes_edac_dmidecode: type 24, detail 0x80, width 72(total 64)
[ 4.582724] EDAC MC0: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[ 4.591145] EDAC MC1: Giving out device to 'ghes_edac.c' 'ghes_edac': DEV ghes
[ 4.599524] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

However, with this BIOS, the "Firmware first" is not working. The
errors are only seen via MCELOG error mechanism:

# mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 5
MISC 20404c4c86 ADDR 320000
TIME 1360931174 Fri Feb 15 07:26:14 2013
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR
Transaction: Memory read error
STATUS 8c00004000010090 MCGSTATUS 0
MCGCAP 1000c14 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.

So, I was unable to test the GHES->EDAC error report method.

Mauro Carvalho Chehab (13):
edac: lock module owner to avoid error report conflicts
ghes: move structures/enum to a header file
ghes: add the needed hooks for EDAC error report
edac: add a new memory layer type
ghes_edac: Register at EDAC core the BIOS report
ghes_edac: Allow registering more than once
edac: add support for raw error reports
ghes_edac: add support for reporting errors via EDAC
ghes_edac: do a better job of filling EDAC DIMM info
edac: better report error conditions in debug mode
edac: initialize the core earlier
ghes_edac.c: Don't credit the same memory dimm twice
ghes_edac: Improve driver's printk messages

drivers/acpi/apei/ghes.c | 64 +++------
drivers/edac/Kconfig | 23 ++++
drivers/edac/Makefile | 1 +
drivers/edac/edac_core.h | 17 +++
drivers/edac/edac_mc.c | 136 ++++++++++++++-----
drivers/edac/edac_mc_sysfs.c | 7 +-
drivers/edac/edac_module.c | 2 +-
drivers/edac/ghes_edac.c | 313 +++++++++++++++++++++++++++++++++++++++++++
include/acpi/ghes.h | 72 ++++++++++
include/linux/edac.h | 5 +
10 files changed, 560 insertions(+), 80 deletions(-)
create mode 100644 drivers/edac/ghes_edac.c
create mode 100644 include/acpi/ghes.h

--
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/