RE: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"

From: Luck, Tony
Date: Mon Jun 27 2022 - 13:28:11 EST


>> 1) Change threshold to "2".
>
> Kinda unconditional that... we haven't talked to other vendors even.

Existing default is 1023 ... which is not a good choice for anyone (except
perhaps ostriches that want to bury their heads in the sand an ignore marginal
DIMMs for as long as possible).

So changing the threshold to "2" would be an improvement in at least being right for
one vendor, instead of wrong for all.

If someone comes up with a different value for another CPU or DIMM vendor
combination ... would we have the RAS_CEC driver check boot_cpu_data.x86_vendor
and SMBIOS to set a different default?

>> 2) Do very smart platform dependent things
>
> If you mean AI, that probably won't happen in the kernel.

Agreed. You don't even need the "probably". This isn't kernel material.

Linux already had a hook in the GHES code to take an error record from
the platform and offline a page. So this "smart" code could be done
by BIOS or BMC just providing the resulting list of pages that should
be taken offline to Linux.

-Tony