Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs
From: Paul E. McKenney
Date: Wed Jan 06 2021 - 14:18:06 EST
On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages. This commit
> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
>
> I tried doing this a while back, but found that in my test case where I forced
> an error that would cause both threads from one core to be "missing", the
> output was highly unpredictable. Some random number of extra CPUs were
> reported as missing. After I added some extra breadcrumbs it became clear
> that pretty much all the CPUs (except the missing pair) entered do_machine_check(),
> but some got hung up at various points beyond the entry point. My only theory
> was that they were trying to snoop caches from the dead core (or access some
> other resource held by the dead core) and so they hung too.
>
> Your code is much neater than mine ... and perhaps works in other cases, but
> maybe the message needs to allow for the fact that some of the cores that
> are reported missing may just be collateral damage from the initial problem.
Understood. The system is probably not in the best shape if this code
is ever executed, after all. ;-)
So how about like this?
pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
Easy enough if so!
> If I get time in the next day or two, I'll run my old test against your code to
> see what happens.
Thank you very much in advance!
For my own testing, is this still the right thing to use?
https://github.com/andikleen/mce-inject
Thanx, Paul