Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"

From: Borislav Petkov
Date: Thu Jun 30 2022 - 03:11:40 EST


On Tue, Jun 28, 2022 at 04:51:49PM +0000, Luck, Tony wrote:
> It fails to use the capabilities of h/w an Linux to avoid a fatal
> error in the future. Corrected errors are (sometimes) a predictor of
> marginal/aging memory. Copying data out of a failing page while there
> are just corrected errors can avoid losing that whole page later.

Hm, for some reason you're trying to persuade me that 2 correctable
errors per page mean that that location is going to turn into
uncorrectable and thus all pages which get two CEs per 24h should
immediately be offlined.

It might and it is commonly accepted that CEs in a DIMM could likely
lead to UEs in the future but not necessarily. That DIMM could trigger
those CEs for years and if the ECC function in the memory controller is
good enough, it could handle those CEs and keep on going like nothing's
happened.

I.e., I'm not buying this unconditional 2 CEs/24h without any sensible
proof. That "study" simply says that someone has done some evaluation
and here's our short-term solution and you should accept it - no
questions asked.

Hell, that study is even advocating the opposite:

"not all the faults (or the pages with the CE rate satisfying a certain
condition) are equally prone to future UEs. The CE rate in the past
period is not a good predictive indicator of future UEs."

So what you're doing is punishing DIMMs which can "wobble" this way with
a couple of CEs for years without causing any issues otherwise.

> Explain further. Apart from the "ostrich" case I'm not sure what they
> are.

Actually, you should explain why this drastic measure of only two
correctable errors, all of a sudden?

The most common failure in DIMMs is single-device failure, modern ECC
schemes can handle those just fine. So what's up?

> It's far more a property of DIMMs than of the CPU. Unless AMD are
> using some DECTED or better level of ECC for memory.

Well, it does the usual any number of bit flips in a single DRAM device
ECC recovery:

https://www.amd.com/system/files/documents/advanced-memory-device-correction.pdf

And the papers quoted there basically say that the majority of failures
are to single DRAM devices which the ECC scheme can handle just fine.

And the multiple DRAM devices failures are a very small percentage of
all the failures.

Which makes me wonder even more why is your change needed at all?

I'd understand if this were some very paranoid HPC system doing very
important computations and where it can't allow itself to suffer UEs so
it'll go and proactively offline pages at the very first sign of trouble
but the data says that the ECC scheme can handle single device failure
just fine and those devices fail only very seldomly and after a loooong
time.

So, if anything, your change should be Intel-only.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette