Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
From: Baoquan He
Date: Tue Apr 28 2015 - 04:42:40 EST
On 04/10/15 at 12:49am, Naoya Horiguchi wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > >
> > > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > >
> > > 1) Fatal
> > > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> >
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> >
> > IMHO, of course.
>
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.
Hi Naoya,
Thanks for great efforts you have made on this issue.
I am trying to catch up, and have read mails in this thread.
Please also add me to CC list when you post a new version. I would like to
review it.
Thanks
Baoquan
>
> > > 2) Execution path recoverable (SRAR in SDM parlance).
> > > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > >
> > > 3) Non-execution path recoverable (SRAO in SDM)
> > > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > > of execution of the current context.
> >
> > Those could be simply made to go to dmesg during kdump, i.e. decouple
> > any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> > some other ras daemon running.
> >
> > So we could reuse the normal handler - we just need to do some tweaking
> > first... AFAICT, of course. I believe in that endeavor, the devil will
> > be in the detail.
>
> OK, I'll try this approach with updating cpu_online_mask.
>
> Thanks,
> Naoya Horiguchi--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/