Re: [PATCH v2 1/2] x86: mce: kexec: turn off MCE in kexec

From: Naoya Horiguchi
Date: Sun Mar 01 2015 - 21:32:45 EST


On Fri, Feb 27, 2015 at 06:27:16PM +0000, Luck, Tony wrote:
> > When CR4.MCE=0b and an MCE happens, it will shutdown the system, at
> > least on Intel, according to Tony
>
> I checked with the architects ... and I was right. If you clear CR4.MCE you'll still
> see the machine check - and you'll pull the big system reset lever.

Thank you for confirmation.

> If you think the other cpus can survive the reset - then the right thing to do is to
> have any offline cpus that show up in the machine check handler just clear MCG_STATUS
> and return:
>
> do_machine_check()
> {
> /* offline cpus may show up for the party - but don't need to do anything here - send them back home */
> if (!(cpu_online(smp_processor_id())) {
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> return;
> }

It seems that kdump shootdown doesn't clear online CPU's cpumask, so this
cpu_online() check doesn't work to this (kdump-specific) problem.
But I think the checking the number of online CPUs for MCE synchronization is
generally correct for other contexts (like MCE under CPU hotremoved system?),
so worth doing in another patch.

> If we are crashing because of a machine check - I wonder how useful it is to run kdump(). There are a very
> small set of ways that you can induce a machine check from program action - normally the problem is that
> something bad happened in the h/w ... a kdump will just fill your disk and waste your time looking at what
> the s/w was dong when the machine check happened.

I don't think every MCE always makes the server inoperative. One good example
is uncorrected errors (including SRAO and SRAR).

And please note that the target of this patch is an MCE when the kernel is
already running on kdump code (so crashing happened *not* because of the MCE).
In that case, we can expect that kdump works fine if the MCE hits the "kdump
shotdown" CPU which are just running cpu_relax() loop, because a 2nd kernel's
CPU isn't affected by the MCE (even the CPU failure is fatal one.)

If a fatal MCE happens on the CPU running kdump code, there's no reason to
try harder to get kdump as you pointed out. In such case, what we can do is
to print out a message like "kdump failed due to MCE" and reset the system.

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/