Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

From: Naoya Horiguchi
Date: Fri Apr 10 2015 - 00:16:47 EST


On Fri, Apr 10, 2015 at 12:49:33AM +0000, Horiguchi Naoya(åå çä) wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > >
> > > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > >
> > > 1) Fatal
> > > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> >
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> >
> > IMHO, of course.
>
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

... wait, changing cpu_online_mask might confuse admins who try to
analyze the kdump, especially when the problems causing panic are CPU
related issues?

In the similar way, changing tolerant value loses the original value,
although this is unlikely to be a problem. But if we change it, using
an upper bit to keep lowest 2 bit to save the original value is better?