Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
From: Naoya Horiguchi
Date: Thu Apr 09 2015 - 20:56:29 EST
On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > account for that, no?
> > >
> > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > as they're idle and not executing code.
> > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > that we might see while trying to run kdump - an remember that all machine checks are currently
> > broadcast, so all cpus whether online or offline will see them
> > 1) Fatal
> > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > to see what happened because we won't have any synchronization failed messages from the offline
> > cpus.
> But this should not be a problem if kdump path keeps cpu_online_mask
> uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> so. Those should clear cpu_online_mask and then mce_start() will work
> fine on the crashing CPU.
> IMHO, of course.
Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
yet,) raising tolerance should work without timeout message.
So I think you are right.
> > 2) Execution path recoverable (SRAR in SDM parlance).
> > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > 3) Non-execution path recoverable (SRAO in SDM)
> > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > of execution of the current context.
> Those could be simply made to go to dmesg during kdump, i.e. decouple
> any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> some other ras daemon running.
> So we could reuse the normal handler - we just need to do some tweaking
> first... AFAICT, of course. I believe in that endeavor, the devil will
> be in the detail.
OK, I'll try this approach with updating cpu_online_mask.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/