Re: [PATCH 1/2] boot: ignore early NMIs

From: Eric W. Biederman
Date: Mon Mar 12 2012 - 14:58:48 EST


Vivek Goyal <vgoyal@xxxxxxxxxx> writes:

> On Mon, Mar 12, 2012 at 03:14:20PM +0900, Fernando Luis VÃzquez Cao wrote:
>
> [..]
>> The thing is that we want to avoid playing with hardware in the kdump
>> reboot patch when we can avoid it, the premise being that it cannot
>> be accessed without risking a lockup or worse (as the deadlock accessing
>> the I/O APIC showed).
>
> I think there needs to be a limit to being paranoid. On one hand people
> want to run panic notifiers, all the kmsg_dump() hooks in panic path, and
> on the other hand we are afraid of even disabling LAPIC.

And the kmsg_dump code and the panic notifiers aren't being run. Having
seen some of their failure modes being patched up recently (Adding and
removing sysfs files!!!!) I'm very comfortable with the level of
paranoia.

It has been proven time and time again that the more you do in the
failing kernel that the greater your likely-hood of not getting your
failure information out.

> I personally think that disabling LAPIC is reasonably practical solution
> to the problem until and unless somebody shows that it deadlocks
> easily.

Disabling NMI generation in the LAPIC is fine, and for the short term
I don't even have a problem with disabling the entire LAPIC as all of
our platforms seem to have code for completely reprogramming it.

At the same time there have been cases like the i8259 routed through
the ExtInt pin of the lapci that we haven't been given programming
information about and that if we want to work we should avoid touching.

Furthermore we have two reported cases of people experiencing real NMIs
on the kdump path. So we have to assume the presence of the CMOS nmi
disable as well if we are going to unequivocally disable NMIs.

Given the variety of x86 hardware today and the growing variety of x86
hardware tomorrow we are going to be fixing this until we can actually
handle the NMIs. Hardware designers are unfortunately creative enough
that we aren't going to think of everything. Given that it is has taken
us almost a decade to realize that there actually is a real world
problem I'm not too keen on a solution that is just good enough to
fix a small problem.

I would love it if x86 had an architectural NMI off switch but with
Intel pushing EFI and the removal of the cmos clock x86 no longer
has an always available NMI off switch.

Furthermore handling of NMI is not hard it is just a little tricky,
to test.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/