Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.

From: Havard Skinnemoen
Date: Wed Jul 09 2014 - 17:51:59 EST


On Wed, Jul 9, 2014 at 1:35 PM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> Havard Skinnemoen <hskinnemoen@xxxxxxxxxx> writes:
>
>> machine_check_poll() was modified to use spin_lock_irqsave independently
>> per bank when a valid MCE is found to prevent duplicated MCE reports by
>> the CMCI and polling methods. In the common case no MCE will be found,
>> so the lock is not acquired until a valid MCE is found. The status is
>> reread after the lock is acquired in case the MCE was already handled by
>> a different thread. A unique spinlock is used per bank number, so
>> contention should be mostly limited to non-shared banks.
>
> This doesn't make sense. Banks are either owned by CMCI or by poll,
> not by both. If you have true duplicates the bug must be somewhere else.

I don't think we got the description right here. I think the real
issue here was machine check polls happening on multiple CPUs with
shared banks, all reporting the same MCEs. This is very reproducible
when booting with mce=no_cmci, since all CPUs will handle all banks,
and there's AFAICT no good way to identify shared banks without
enabling CMCI.

There may have been an interaction with CMCI here too at some point,
but it's possible that went away with the timer patch (which we did a
bit later).

Havard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/