Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

From: Borislav Petkov
Date: Thu Jan 07 2021 - 02:08:29 EST


On Wed, Jan 06, 2021 at 11:13:53AM -0800, Paul E. McKenney wrote:
> Not yet, it isn't! Well, except in -rcu. ;-)

Of course it is - saying "This commit" in this commit's commit message
is very much a tautology. :-)

> You are suggesting dropping mce_missing_cpus and just doing this?
>
> if (!cpumask_andnot(&mce_present_cpus, cpu_online_mask, &mce_present_cpus))

Yes.

And pls don't call it "holdout CPUs" and change the order so that it is
more user-friendly (yap, you don't need __func__ either):

[ 78.946153] mce: Not all CPUs (24-47,120-143) entered the broadcast exception handler.
[ 78.946153] Kernel panic - not syncing: Timeout: MCA synchronization.

or so.

And that's fine if it appears twice as long as it is the same info - the
MCA code is one complex mess so you can probably guess why I'd like to
have new stuff added to it be as simplistic as possible.

> I was worried (perhaps unnecessarily) about the possibility of CPUs
> checking in during the printout operation, which would set rather than
> clear the bit. But perhaps the possible false positives that Tony points
> out make this race not worth worrying about.
>
> Thoughts?

Yah, apparently, it is not going to be a precise report as you wanted it
to be but at least it'll tell you which *sockets* you can rule out, if
not cores.

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette