Re: [External] [PATCH] x86/mce: count the number of occurrences of each MCE severity

From: Rui Qi
Date: Mon Jun 24 2024 - 00:49:21 EST


From: Rui Qi <qirui.001@xxxxxxxxxxxxx>

Hi Tony,

> You seem to have problems with the e-mail infrastructure. I got a few extra copies
> of this in HTML format. This one is in plain text, but the From: header says "$(name)"
>

Sorry, some problem with my thunderbird mail agent. I will use git send-mail instead from now on.
>
>>> So you either covered a case in the severities table, or you didn't. Does it
>>> help to know that you covered a case multiple times?
>>>
>>
>> In the fault injection test in the laboratory, we inject errors multiple
>> times and need a counter to tell us how many times each case has
>> occurred and compare it with the expected number to determine the test
>> results
>
> In my testing on Intel/x86 I don't always see a 1:1 mapping between my
> test, and the severities rule. This is because of a h/w race between the
> memory controller reporting the error when it sees an uncorrectable ECC
> issue, and the core trying to consume the poisoned data. If the memory
> controller signal wins the race, Linux takes the page offline and there isn't
> a poison consumption error, just a page fault.
>
>> In the production environment, the counter can reflect the actual number
>> of times each MCE error type occurs, which can help us detect the MCE
>> error distribution of large-scale Data center infrastructure
>
> That could be useful.
>
Thank you for your expertise!

>>>> Due to the limitation of char type, the maximum supported statistics are
>>>> currently 255 times
>>>>
>>
>> How about changing char to u64, which is enough for real-world
>> situations and won't waste a lot of memory?
>
> u64 seems like serious overkill. A change from "unsigned char" to "unsigned int"
> would keep track of 4 billion errors. That seems like plenty :-)
>> -Tony
Yes, unsinged int is far enough.

BTW, if you dont mind, I will send a V2 patch based on our discussion.