Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

From: Dave Hansen
Date: Thu Oct 12 2023 - 11:52:26 EST


On 10/12/23 04:46, Sironi, Filippo wrote:
> There's correlation across the errors that we're seeing, indeed,
> we're looking at the same row being responsible for multiple CPUs
> tripping and running into #MC. I still don't like the full lack of
> visibility; it's not uncommon in a large fleet to see to take a
> server out of production, replace a DIMM and shortly after taking it
> out of production again to replace another DIMM just because some of
> the errors weren't properly logged.

So you had two nearly simultaneous DIMM failures. The first failed,
filled up the buffer and then the second failed, but there was no room.
The second failed *SO* soon after the first that there was no
opportunity to empty the buffer between.

Right?

How do you know that storing 8 pages of records will catch this case as
opposed to storing 2?

>> Is there any way that the size of the pool can be more automatically
>> determined? Is the likelihood of a bunch errors proportional to the
>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>
>> Could the pool be emptied more aggressively so that it does not fill up?

You didn't really address the additional questions I posed there.

I'll add one more: how many of the messages are duplicates or
*effectively* duplicates? Or is that hard to determine at the time that
the entries are being made that they are duplicates?

It _should_ also be fairly easy to enlarge the buffer on demand, say, if
it got half full. What's the time scale over which the buffer filled
up? Did a single #MC fill it up?

I really think we need to understand what the problem is and have _some_
confidence that the proposed solution will fix that, even if we're just
talking about a new config option.