Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"

From: Borislav Petkov
Date: Tue Jun 28 2022 - 12:00:33 EST


On Mon, Jun 27, 2022 at 05:27:57PM +0000, Luck, Tony wrote:
> Existing default is 1023 ... which is not a good choice for anyone (except
> perhaps ostriches that want to bury their heads in the sand an ignore marginal
> DIMMs for as long as possible).

Why isn't that a good choice?

I'm sure there are error rates where this fits just fine.

> So changing the threshold to "2" would be an improvement in at least
> being right for one vendor, instead of wrong for all.

So I'm pretty sure that is not needed on AMD at all.

> Linux already had a hook in the GHES code to take an error record from
> the platform and offline a page. So this "smart" code could be done
> by BIOS or BMC just providing the resulting list of pages that should
> be taken offline to Linux.

So my worry is some firmware agent interfering with our recovery
strategy. And reportedly, there are people who don't like the firmware
recovery at all and prefer it all is done in the OS.

Which then makes it a problem of how to synchronize with the firmware
about who does what in RAS. And we don't have any API here...

Anyway, this is just a worry I have from watching where it all goes
to.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette