Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
From: M K, Muralidhara
Date: Fri Oct 27 2023 - 01:11:09 EST
On 10/26/2023 7:10 PM, Borislav Petkov wrote:
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
On Thu, Oct 26, 2023 at 09:05:51AM -0400, Yazen Ghannam wrote:
Post-processing is one of the features that Avadhut implemented.
https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536419803f236e6118
Hi Yazen, Thanks for pointing to this commit. Yes I do remember.
Yes, now try to decode the error with rasdaemon this way, by supplying
the fields.
Then explain step-by-step what you've done in the commit message and in
a documentation file in Documentation/ras/ so that people can find it
and can actually do the decoding themselves.
It needs to be absolutely easy to decode those errors. Not tell people:
"go look for the error description in the PPR".
Yes, we have offline decoding option in rasdaemon
For example:
$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00 --smca
2023-10-26 23:51:34 -0500, Unified Memory Controller (bank=0), mca: DRAM
ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx:
generic, level: L3/generic', mci: Error_overflow CECC, Locn:
memory_channel=0,csrow=0, Error Msg: Corrected error, no action required.
Observed the error string "mca: DRAM ECC error. Ext Err Code: 0"
Also, we can pass particular family/model to decode, Ex:for MI300A
$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00
--smca --family 0x19 --model 0x90 --bank 19
2023-10-26 23:52:09 -0500, Unified Memory Controller (bank=19), mca:
DRAM On Die ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic
read, tx: generic, level: L3/generic', mci: Error_overflow CECC, Locn:
memory_die_id=1, Error Msg: Corrected error, no action required.
Observed the error string as "mca: DRAM On Die ECC error. Ext Err Code: 0"
Thanks for the inputs. I will add the steps in commit message and in
Documentation as well.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette