Re: [PATCH] x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks

From: Borislav Petkov
Date: Wed Oct 26 2022 - 16:12:29 EST


On Wed, Oct 26, 2022 at 07:44:17PM +0000, Yazen Ghannam wrote:
> 1) Apply the patch I submitted as a simple fix/workaround for the presented
> symptom. I tried to keep it small and well described to be a stable backport.
> Obviously I wrote it without knowing the shared kobject behavior isn't ideal.

We'll see.

> 2) Address the shared kobject thing.
> Here are some options:
> a. Only set up the thresholding kobject on a single CPU per "AMD Node".
> Technically MCA Bank 4 is "shared" on legacy systems. But AFAICT from
> looking at old BKDG docs, in practice only the "Node Base Core" can access
> the registers. This behavior is controlled by a bit in NB which BIOS is
> supposed to set. Maybe some BIOSes don't do this, but I think that's a
> "broken BIOS on legacy system" issue if so.

I guess we can do that. And I even think we have some code which finds
out which the NBC is...

/me greps a bit:

ah, there it is: get_nbc_for_node() in arch/x86/kernel/cpu/mce/inject.c.


> b. Disable the MCA Thresholding interface for Families before 0x17.

Can't. It is user-visible and you don't know for sure whether someone is
using it or not.

Believe me, I have been wanting to disable this thing forever. I've
never heard of anyone using it and all the energy we put in it was for
nothing. :-\

We could try to deprecate it, though, make it default=n in Kconfig and
see who complains. And after a couple of releases, kill it.

> This is an undocumented interface,

Of course it is documented - it is in the old BKDGs.

> and I don't know if anyone is using it on older systems.

Yap.

> The issue we're discussing here started because of a splat during
> suspend/resume/CPU hotplug. In disable_err_thresholding(), we disable
> MCA Thresholding for bank 4 on Family 15h, so there's some precedent.
> c. Do nothing at the moment. I *really* want to clean up the MCA
> Thresholding interface, and the shared kobject thing may get resolved
> in that.

Clean it up how exactly?

Put it behind a Kconfig item, disable it and remove it after a while?

:-)

If so, I wouldn't mind. No one's using this. At least I haven't heard of
a single bug report or of a use case. Only when CPU hotplug explodes and
that thing is involved, only then.

Might as well remove it. And then remove it in the hardware too. RAS
folks would love to get rid of some of that crap which takes up verif
resources for no good reason.

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette