[PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

From: Calvin Owens
Date: Thu Dec 04 2014 - 21:31:17 EST

The Intel CMCI interrupt handler calls mce_timer_kick() to force more
frequent polling for MCE events when a CMCI storm occurs and CMCI
interrupts are subsequently disabled.

If a CMCI interrupt storm happens to be detected while the timer
interrupt is executing timer functions, mce_timer_kick() can race with
mce_timer_fn(), which results in a double-add and the following BUG:

#0 [ffff88047fda3ad0] machine_kexec at ffffffff8102bdf5
#1 [ffff88047fda3b20] crash_kexec at ffffffff8109e788
#2 [ffff88047fda3bf0] oops_end at ffffffff815f20e8
#3 [ffff88047fda3c20] die at ffffffff81005c08
#4 [ffff88047fda3c50] do_trap at ffffffff815f192b
#5 [ffff88047fda3cb0] do_invalid_op at ffffffff81002f42
#6 [ffff88047fda3d60] invalid_op at ffffffff815fa668
[exception RIP: add_timer_on+234]
RIP: ffffffff8104d05a RSP: ffff88047fda3e18 RFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88047fdacbc0 RCX: 000000001fbee3ff
RDX: ffff88047fda0000 RSI: 000000000000001d RDI: ffff88047fdacbc0
RBP: ffff88047fda3e58 R8: 0000000000000000 R9: ffffffff81aa0940
R10: 0720072007200720 R11: 0720072007200765 R12: ffff880474a6c000
R13: 0000000000000101 R14: 000000000000001d R15: ffff88047fdacbc0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff88047fda3e60] mce_timer_fn at ffffffff8101f524
#8 [ffff88047fda3e80] call_timer_fn at ffffffff8104b4fa
#9 [ffff88047fda3ec0] run_timer_softirq at ffffffff8104ce70

The timer_add() in mce_timer_kick() is actually unnecessary: since the
timer is re-added by its handler function, the only case in which the
timer doesn't exist is when the CMCI interrupt calls mce_timer_kick() in
the interval between the timer firing and mce_timer_fn() actually being
executed. Thus, the timer work will be performed by mce_timer_fn() just
after the interrupt exits.

This patch removes the add_timer() from mce_timer_kick(), and disables
local interrupts during mce_timer_fn() so that mce_timer_fn() will
always pick up the timer interval value that mce_timer_kick() drops
in the PERCPU variable.

This means that the CMCI interrupt that hits the storm threshold will
call mce_timer_kick() either:

1) In the interval between the mce_timer firing and mce_timer_fn()
disabling local IRQs. In this case, mce_timer_fn() will
immediately execute after the CMCI handler exits, and will
use the interval loaded in the PERCPU variable from
mce_timer_kick() to calculate its next timer interval.

2) Happen after mce_timer_fn() has done its work, in which case
the existing timer will be updated with the new interval if
it is before the existing one.

Signed-off-by: Calvin Owens <calvinowens@xxxxxx>
arch/x86/kernel/cpu/mcheck/mce.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668ce..7074a90 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1286,7 +1286,7 @@ static int cmc_error_seen(void)
static void mce_timer_fn(unsigned long data)
struct timer_list *t = this_cpu_ptr(&mce_timer);
- unsigned long iv;
+ unsigned long iv, flags;
int notify;

WARN_ON(smp_processor_id() != data);
@@ -1301,6 +1301,9 @@ static void mce_timer_fn(unsigned long data)
* Alert userspace if needed. If we logged an MCE, reduce the
* polling interval, otherwise increase the polling interval.
+ local_irq_save(flags);
iv = __this_cpu_read(mce_next_interval);
notify = mce_notify_irq();
notify |= cmc_error_seen();
@@ -1316,6 +1319,8 @@ static void mce_timer_fn(unsigned long data)
t->expires = jiffies + iv;
add_timer_on(t, smp_processor_id());
+ local_irq_restore(flags);

@@ -1330,9 +1335,6 @@ void mce_timer_kick(unsigned long interval)
if (timer_pending(t)) {
if (time_before(when, t->expires))
mod_timer_pinned(t, when);
- } else {
- t->expires = round_jiffies(when);
- add_timer_on(t, smp_processor_id());
if (interval < iv)
__this_cpu_write(mce_next_interval, interval);

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/