Re: [PATCH v4] EDAC/ghes: Setup DIMM label from DMI and use it in error reports

From: Borislav Petkov
Date: Tue Jun 02 2020 - 11:48:46 EST


On Thu, May 28, 2020 at 12:13:06PM +0200, Robert Richter wrote:
> The ghes driver reports errors with 'unknown label' even if the actual
> DIMM label is known, e.g.:
>
> EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
> module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
> page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
> node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
> location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
> DRAM memory)
>
> Fix this by using struct dimm_info's label string in error reports:
>
> EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
> rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
> page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
> node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
> location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
> DRAM memory)
>
> The labels are initialized by reading the bank and device strings from
> DMI. Now, the label information can also read from sysfs. E.g. a
> ThunderX2 system will show the following:
>
> /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
> /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
> /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
> /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
> /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
> /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
> /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
> /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
> /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
> /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
> /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
> /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
> /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
> /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
> /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
> /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0
>
> Since dimm_labels can be rewritten, that label will be used in a later
> error report:
>
> # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
> # # some error injection here
> # dmesg | grep foobar
> [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
> module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
> page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
> node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
> location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
> memory)
>
> Signed-off-by: Robert Richter <rrichter@xxxxxxxxxxx>
> ---
> v4:
>
> * dimm->label: Only update dimm->label in if bank/device is found in
> the SMBIOS table, this keeps current behavior for machines that do
> not provide this information.
>
> * e->location: Keep current behavior how e->location is written.
>
> * e->label: Use dimm->label if a DIMM was found by its handle and
> "unknown memory" otherwise. This aligns with the edac_mc
> implementation.
>
> Signed-off-by: Robert Richter <rrichter@xxxxxxxxxxx>
> ---
> drivers/edac/ghes_edac.c | 37 ++++++++++++++++++++++++++-----------
> 1 file changed, 26 insertions(+), 11 deletions(-)

Yap, looks good. I'll queue it after the merge window.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette