[PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump

From: Naoya Horiguchi
Date: Thu Mar 05 2015 - 01:48:53 EST


On Thu, Mar 05, 2015 at 01:24:47AM +0000, Horiguchi Naoya(åå çä) wrote:
> On Wed, Mar 04, 2015 at 11:12:33PM +0000, Luck, Tony wrote:
> > > - fixed AR and UC order in enum severity_level because UC is severer than AR
> > > by definition. Current code is not affected by this wrong order by chance.
> >
> > AR and AO are both UC errors - that happen also to be recoverable.
>
> Maybe just saying "UC" might be confusing, meaning "UC bit is set" or "type of
> error (defined in Table SDM's Table 15-6 in vol.3B) is 'Uncorrected Error'
> (clearly separate from SRAR/SRAO)". You seem to mean the former, and I meant
> the latter. So I should write the description more accurately like "UC=1,PCC=0"
> error and "UC=1,PCC=1" error.
>
> > Are you really sure
> > about this re-order not affecting existing code?
>
> Sorry, I thought I checked that but I missed one, so let me check again now.
> I checked all referencing site of MCE_*_SEVERITY. Most of them are using '=='
> to compare the severity, where the re-order doesn't affect them.
> Some are using inequalities:
>
> - around ./arch/x86/kernel/cpu/mcheck/mce.c:720,
>
> if (mce_severity(m, mca_cfg.tolerant, msg, true) >=
> MCE_PANIC_SEVERITY)
>
> , the re-order doesn't affect.
>
> - ./arch/x86/kernel/cpu/mcheck/mce.c:819:
>
> if (m && global_worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
>
> , the re-order doesn't affect.
>
> - ./arch/x86/kernel/cpu/mcheck/mce.c:832:
>
> if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
>
> , ditto.
>
> - ./arch/x86/kernel/cpu/mcheck/mce.c:1196:
>
> no_way_out = worst >= MCE_PANIC_SEVERITY;
>
> , ditto.
>
> - ./arch/x86/kernel/cpu/mcheck/mce-severity.c:211:
>
> if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
>
> , the re-order should change to s->sev >= MCE_AR_SEVERITY to keep the
> kernel behavior.
>
> So I fixed the last part to be included in the re-order patch.
>
> > You might well be right, but as every one
> > else has pointed out mce_severity() is full of odd subtleties that catch people out.
>
> I agree that this one big table is hard to maintain and bug-prone. One problem
> is that it has too many fields to check so the parameter space is huge. I think
> some field are checked only once, so separating it out makes table more simple
> and readable.
>
> > Is the "UC" entry at the end of the severities[] table just a catch-all for things that made it
> > past all the other entries? Does it ever really get used?
>
> I read through the severity check table and it seems that all UC=1 case
> are already considered by the above entries, so it seems not used.
>
> > What was the test case that made you promote UC above AR?
>
> I thought of "Action required but unaffected thread is continuable" case
> on kexec kernel, but I'm not sure that such a case really happens.
> My motivation on the promotion was mainly from what SDM says rather than
> the real testcase.
>
> > This absolutely should not be buried in the middle of your other patch - it needs to
> > be separate with a much more than two lines of commit description.
>
> OK, I might not include this part in this series in later post, but if I do,
> I'll separete it out.

Although I don't think we finished the discussion over v4, I updated the
patch to reflect your feedbacks so far.

Of cource, if you still have some comment about the previous version, I'm
glad to hear that.

Thanks,
Naoya Horiguchi
----
From bf4ce58b8296774a69e5436f43e8dc9eed41a829 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
Date: Thu, 5 Mar 2015 15:28:23 +0900
Subject: [PATCH v5] x86: mce: kexec: switch MCE handler for kexec/kdump

kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize MCE device yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> [2.6.32+]
---
ChangeLog v4 -> v5:
- drop MCE_UC/AR_SEVERITY re-ordering
- move most of code to arch/x86/kernel/crash.c
- export some MCE internal variables/routines via arch/x86/include/asm/mce.h

ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
global flag to ignore MCE events.
- fixed the description of the problem
---
arch/x86/include/asm/mce.h | 19 +++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
arch/x86/kernel/cpu/mcheck/mce.c | 10 ++--
arch/x86/kernel/crash.c | 87 +++++++++++++++++++++++++++++++
4 files changed, 111 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..fbb385611a14 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -248,4 +248,23 @@ struct cper_sec_mem_err;
extern void apei_mce_report_mem_error(int corrected,
struct cper_sec_mem_err *mem_err);

+enum severity_level {
+ MCE_NO_SEVERITY,
+ MCE_DEFERRED_SEVERITY,
+ MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+ MCE_KEEP_SEVERITY,
+ MCE_SOME_SEVERITY,
+ MCE_AO_SEVERITY,
+ MCE_UC_SEVERITY,
+ MCE_AR_SEVERITY,
+ MCE_PANIC_SEVERITY,
+};
+
+int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
+
+extern void mce_panic(char *msg, struct mce *final, char *exp);
+extern u64 mce_rdmsrl(u32 msr);
+extern void mce_wrmsrl(u32 msr, u64 v);
+extern inline void mce_gather_info(struct mce *m, struct pt_regs *regs);
+extern void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
#endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..909ee3ed95dd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
#include <linux/device.h>
#include <asm/mce.h>

-enum severity_level {
- MCE_NO_SEVERITY,
- MCE_DEFERRED_SEVERITY,
- MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
- MCE_KEEP_SEVERITY,
- MCE_SOME_SEVERITY,
- MCE_AO_SEVERITY,
- MCE_UC_SEVERITY,
- MCE_AR_SEVERITY,
- MCE_PANIC_SEVERITY,
-};
-
#define ATTR_LEN 16

/* One object for each MCE bank, shared by all CPUs */
@@ -23,7 +11,6 @@ struct mce_bank {
char attrname[ATTR_LEN]; /* attribute name */
};

-int mce_severity(struct mce *a, int tolerant, char **msg, bool is_excp);
struct dentry *mce_get_debugfs_dir(void);

extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3112b79ace8e..60a307a4b507 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -109,7 +109,7 @@ mce_banks_t mce_banks_ce_disabled;

static DEFINE_PER_CPU(struct work_struct, mce_work);

-static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
+void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);

/*
* CPU/chipset specific EDAC code can register a notifier call here to print
@@ -311,7 +311,7 @@ static void wait_for_panic(void)
panic("Panicing machine check CPU died");
}

-static void mce_panic(char *msg, struct mce *final, char *exp)
+void mce_panic(char *msg, struct mce *final, char *exp)
{
int i, apei_err = 0;

@@ -391,7 +391,7 @@ static int msr_to_offset(u32 msr)
}

/* MSR access wrappers used for error injection */
-static u64 mce_rdmsrl(u32 msr)
+u64 mce_rdmsrl(u32 msr)
{
u64 v;

@@ -416,7 +416,7 @@ static u64 mce_rdmsrl(u32 msr)
return v;
}

-static void mce_wrmsrl(u32 msr, u64 v)
+void mce_wrmsrl(u32 msr, u64 v)
{
if (__this_cpu_read(injectm.finished)) {
int offset = msr_to_offset(msr);
@@ -433,7 +433,7 @@ static void mce_wrmsrl(u32 msr, u64 v)
* check into our "mce" struct so that we can use it later to assess
* the severity of the problem as we read per-bank specific details.
*/
-static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
+inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
{
mce_setup(m);

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 6f3baedcb6f6..8586dfff0e52 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
+#include <asm/mce.h>

/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
@@ -76,6 +77,90 @@ struct crash_memmap_data {

int in_crash_kexec;

+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+ struct mce m, final;
+ char *msg = NULL;
+ char *nmsg = NULL;
+ int i;
+ int worst = 0;
+ int severity;
+
+ if (!mca_cfg.banks)
+ goto out;
+ if (kdump_cpu != smp_processor_id())
+ goto clear_mcg_status;
+
+ mce_gather_info(&m, regs);
+ for (i = 0; i < mca_cfg.banks; i++) {
+ m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+ if (m.status & MCI_STATUS_VAL)
+ if (quirk_no_way_out)
+ quirk_no_way_out(i, &m, regs);
+ severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+ if (severity > worst) {
+ final = m;
+ worst = severity;
+ msg = nmsg;
+ }
+ }
+
+ if (worst >= MCE_UC_SEVERITY)
+ mce_panic("unignorable machine check under kdumping", &final, msg);
+
+ pr_emerg("MCE triggered under kdumping. If you are lucky enough, you will have a kdump. Otherwise, this is a dying message.\n");
+
+clear_mcg_status:
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+ sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be quiet
+ * as possible.)
+ *
+ * Moreover, in such situation getting kdump is prior to doing something for
+ * MCEs because what the users are really interested in is to find what caused
+ * the crashing, not what caused the crashing to fail.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void switch_mce_handler_for_kdump(void)
+{
+ kdump_cpu = smp_processor_id();
+ machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void switch_mce_handler_for_kdump(void) {}
+#endif
+
/*
* This is used to VMCLEAR all VMCSs loaded on the
* processor. And when loading kvm_intel module, the
@@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/* The kernel is broken so disable interrupts */
local_irq_disable();

+ switch_mce_handler_for_kdump();
+
kdump_nmi_shootdown_cpus();

/*
--
1.9.3N‹§²æ¸›yú²X¬¶ÇvØ–)Þ{.nlj·¥Š{±‘êX§¶›¡Ü}©ž²ÆzÚj:+v‰¨¾«‘êZ+€Êzf£¢·hšˆ§~†­†Ûÿû®w¥¢¸?™¨è&¢)ßf”ùy§m…á«a¶Úÿ 0¶ìå