Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions

From: M K, Muralidhara
Date: Fri Oct 27 2023 - 01:11:09 EST




On 10/26/2023 7:10 PM, Borislav Petkov wrote:
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


On Thu, Oct 26, 2023 at 09:05:51AM -0400, Yazen Ghannam wrote:
Post-processing is one of the features that Avadhut implemented.

https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536419803f236e6118


Hi Yazen, Thanks for pointing to this commit. Yes I do remember.


Yes, now try to decode the error with rasdaemon this way, by supplying
the fields.

Then explain step-by-step what you've done in the commit message and in
a documentation file in Documentation/ras/ so that people can find it
and can actually do the decoding themselves.

It needs to be absolutely easy to decode those errors. Not tell people:
"go look for the error description in the PPR".

Yes, we have offline decoding option in rasdaemon

For example:
$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00 --smca
2023-10-26 23:51:34 -0500, Unified Memory Controller (bank=0), mca: DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', mci: Error_overflow CECC, Locn: memory_channel=0,csrow=0, Error Msg: Corrected error, no action required.

Observed the error string "mca: DRAM ECC error. Ext Err Code: 0"


Also, we can pass particular family/model to decode, Ex:for MI300A

$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00 --smca --family 0x19 --model 0x90 --bank 19
2023-10-26 23:52:09 -0500, Unified Memory Controller (bank=19), mca: DRAM On Die ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', mci: Error_overflow CECC, Locn: memory_die_id=1, Error Msg: Corrected error, no action required.

Observed the error string as "mca: DRAM On Die ECC error. Ext Err Code: 0"

Thanks for the inputs. I will add the steps in commit message and in Documentation as well.


Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette