Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump

From: Naoya Horiguchi
Date: Thu Apr 09 2015 - 02:58:55 EST


On Thu, Apr 09, 2015 at 08:13:46AM +0200, Borislav Petkov wrote:
> On Tue, Apr 07, 2015 at 08:02:18AM +0000, Naoya Horiguchi wrote:
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. But the MCE handler is still enabled after that,
> > so if MCE happens and broadcasts over the CPUs after the main thread starts
> > the 2nd kernel (which might not initialize its MCE handler yet, or might decide
> > not to enable it,) MCE handler runs only on the other CPUs (not on the main
> > thread,) leading to kernel panic with MCE synchronization. The user-visible
> > effect of this bug is kdump failure.
> >
> > Our standard MCE handler do_machine_check() assumes some about system's
> > status and it's hard to alter it to cover kexec/kdump context, so let's add
> > another kdump-specific one and switch to it.
> >
> > Note that this problem exists since current MCE handler was implemented in
> > 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> > a timeout") made it more visible by changing the default behavior of the
> > synchronization timeout from "ignore" to "panic".
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> > ---
> > ChangeLog: v7 -> v8
>
> Ok, I cleaned it up a bit and straightened some comments, see below. But
> other than minor issues I don't see anything wrong with this patch so
> far.
>
> Btw, Ingo had some reservations about this. Ingo?
>
> Also Tony hasn't Ack/Naked this...
>
> The more important question for me is how are you testing this? I did
> try injecting some MCEs with qemu while the second kernel is booting but
> that caused a triple-fault or the guest froze completely.

Yes, I did see it at fisrt, so I did two tweaks for the testing:

1) to fix qemu code. I think that current mce injection code of qemu is buggy,
because when we try to inject MCE in broadcast mode, all injections other than
the first one are done with MCG_STATUS_MCIP (see cpu_x86_inject_mce()@target-i386/helper.c.)
It looks to me a bug because this means that every (broadcast mode) MCE injection
causes triplet-fault, which seems not mimicking the real HW behavior.

2) to insert the delay (for a few seconds) into kdump_nmi_callback() before
disable_local_APIC(). This is because MCE interrupt is delivered to CPUs in
different manners in qemu and in bare metal. Bare metals do respond to MCE
interrupts after disable_local_APIC(), but qemu not.


Unfortunately our testing (~15000 times kdump/reboot cycles) with the debug
kernel on bare metals didn't reproduce the problem yet, but I believe that
the above testing on qemu should hit a target.

Thanks,
Naoya Horiguchi

>
> Hmmm.
>
> Thanks.
>
> ---
> From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Date: Tue, 7 Apr 2015 08:02:18 +0000
> Subject: [PATCH] x86/mce, crash: Switch MCE handler for kexec/kdump
>
> kexec disables (or "shoots down") all CPUs other than the crashing
> CPU before entering the 2nd kernel. However, MCA is still enabled so
> if an MCE happens and broadcasts to the CPUs after the main thread
> starts the 2nd kernel (which might not initialize its MCE handler yet,
> or might decide not to enable it) the MCE handler runs only on the
> other CPUs (not on the main thread) leading to kernel panic during MCE
> synchronization. The user-visible effect of this bug is a kdump failure.
>
> Our standard MCE handler do_machine_check() assumes a bunch of things
> about system's status and it's hard to alter it to cover kexec/kdump
> context, so add another, kdump-specific one and switch to it.
>
> Note that this problem exists since current MCE handler was implemented
> in 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core
> has reached a timeout") made it more visible by changing the default
> behavior of the synchronization timeout from "ignore" to "panic".
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Cc: "Tony Luck" <tony.luck@xxxxxxxxx>
> Cc: Prarit Bhargava <prarit@xxxxxxxxxx>
> Cc: Vivek Goyal <vgoyal@xxxxxxxxxx>
> Cc: "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>
> Cc: Junichi Nomura <j-nomura@xxxxxxxxxxxxx>
> Cc: Kiyoshi Ueda <k-ueda@xxxxxxxxxxxxx>
> Link: http://lkml.kernel.org/r/1425373306-26187-1-git-send-email-n-horiguchi@xxxxxxxxxxxxx
> Signed-off-by: Borislav Petkov <bp@xxxxxxx>
> ---
> arch/x86/include/asm/mce.h | 14 +++++
> arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
> arch/x86/kernel/crash.c | 89 +++++++++++++++++++++++++++++++
> 3 files changed, 103 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 1f5a86d518db..a88a74e19d14 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -255,4 +255,18 @@ struct cper_sec_mem_err;
> extern void apei_mce_report_mem_error(int corrected,
> struct cper_sec_mem_err *mem_err);
>
> +enum severity_level {
> + MCE_NO_SEVERITY,
> + MCE_DEFERRED_SEVERITY,
> + MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> + MCE_KEEP_SEVERITY,
> + MCE_SOME_SEVERITY,
> + MCE_AO_SEVERITY,
> + MCE_UC_SEVERITY,
> + MCE_AR_SEVERITY,
> + MCE_PANIC_SEVERITY,
> +};
> +
> +extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
> +
> #endif /* _ASM_X86_MCE_H */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index fe32074b865b..47d1e5218fb5 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -1,18 +1,6 @@
> #include <linux/device.h>
> #include <asm/mce.h>
>
> -enum severity_level {
> - MCE_NO_SEVERITY,
> - MCE_DEFERRED_SEVERITY,
> - MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> - MCE_KEEP_SEVERITY,
> - MCE_SOME_SEVERITY,
> - MCE_AO_SEVERITY,
> - MCE_UC_SEVERITY,
> - MCE_AR_SEVERITY,
> - MCE_PANIC_SEVERITY,
> -};
> -
> #define ATTR_LEN 16
> #define INITIAL_CHECK_INTERVAL 5 * 60 /* 5 minutes */
>
> @@ -24,7 +12,6 @@ struct mce_bank {
> char attrname[ATTR_LEN]; /* attribute name */
> };
>
> -extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
> struct dentry *mce_get_debugfs_dir(void);
>
> extern struct mce_bank *mce_banks;
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index aceb2f90c716..f4948a8d5fa6 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -34,6 +34,7 @@
> #include <asm/cpu.h>
> #include <asm/reboot.h>
> #include <asm/virtext.h>
> +#include <asm/mce.h>
>
> /* Alignment required for elf header segment */
> #define ELF_CORE_HEADER_ALIGN 4096
> @@ -76,6 +77,92 @@ struct crash_memmap_data {
>
> int in_crash_kexec;
>
> +#ifdef CONFIG_X86_MCE
> +static int kdump_cpu;
> +
> +/*
> + * kdump-specific machine check handler
> + *
> + * When kexec/kdump is running, what the MCE handler is expected to do
> + * changes depending on whether the CPU is running the main thread or not.
> + *
> + * The crashing CPU, controlling the whole system exclusively, should
> + * try to complete a dump as hard as possible even if an MCE happens
> + * concurrently because some types of non-fatal MCEs (for example,
> + * uncorrected errors like SRAO) don't necessarily impair kdump's
> + * reliability (consider that an MCE can hit another CPU, in which
> + * case corrupted data might not get consumed). If an MCE critically
> + * breaks kdump operation, we are unlucky and then have to accept the HW
> + * failure. In that case, we hope that at least a dying message reaches
> + * the admins.
> + *
> + * The other CPUs are supposed to be quiet during kexec/kdump, so after
> + * the crashing CPU shot them down, they should not do anything except
> + * clearing MCG_STATUS (they need to, otherwise the system gets reset,
> + * which is undesirable either). Note that this is also true after the
> + * crashing CPU enters the 2nd kernel.
> + */
> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> + char *msg = NULL, *nmsg = NULL;
> + int i, severity, worst = 0;
> + struct mce m = {};
> +
> + if (mca_cfg.disabled)
> + return;
> +
> + if (!mca_cfg.banks)
> + goto out;
> +
> + if (kdump_cpu != smp_processor_id())
> + goto clear_mcg_status;
> +
> + rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> + if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> + m.cs = regs->cs;
> +
> + for (i = 0; i < mca_cfg.banks; i++) {
> + rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
> + severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
> + if (severity > worst) {
> + worst = severity;
> + msg = nmsg;
> + }
> + }
> +
> + if (worst >= MCE_UC_SEVERITY)
> + panic("kdump: Fatal machine check: %s", msg);
> +
> + pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
> +
> +clear_mcg_status:
> + wrmsrl(MSR_IA32_MCG_STATUS, 0);
> +out:
> + sync_core();
> +}
> +
> +/*
> + * Switch the MCE handler to kdump-specific one
> + *
> + * Standard MCE handler do_machine_check() is not designed for kexec/kdump
> + * context, where we can't expect MCE's recovering and logging to work fine
> + * because the kernel might be unstable (all CPUs except one must be idle).
> + *
> + * In such situations, getting a kernel dump is more important than handling
> + * MCEs because what the users are really interested in is to find out what
> + * caused the crash.
> + *
> + * So let's switch MCE handler to the one suitable for kexec/kdump situation.
> + */
> +void kdump_setup_mce(void)
> +{
> + kdump_cpu = smp_processor_id();
> + machine_check_vector = machine_check_under_kdump;
> +}
> +#else
> +static inline void kdump_setup_mce(void) {}
> +#endif
> +
> /*
> * This is used to VMCLEAR all VMCSs loaded on the
> * processor. And when loading kvm_intel module, the
> @@ -157,6 +244,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> /* The kernel is broken so disable interrupts */
> local_irq_disable();
>
> + kdump_setup_mce();
> +
> kdump_nmi_shootdown_cpus();
>
> /*
> --
> 2.3.3
>
>
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --
> --
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/