Re: [V2 PATCH 0/6] x86, NMI: give NMI handler a face-lift

From: Jason Wessel
Date: Fri Nov 12 2010 - 10:56:35 EST


On 11/12/2010 09:42 AM, Don Zickus wrote:
> On Fri, Nov 12, 2010 at 09:05:03AM -0600, Jason Wessel wrote:
>
>> On 11/12/2010 08:43 AM, Don Zickus wrote:
>>
>>> Restructuring the nmi handler to be more readable and simpler.
>>>
>>> This is just laying the ground work for future improvements in this area.
>>>
>>> I also left out one of Huang's patch until we figure out how we are going
>>> to proceed with a new notifier.
>>>
>>> Tested 32-bit and 64-bit on AMD and Intel machines.
>>>
>>> V2: add a patch to kill DIE_NMI_IPI and add in priorities
>>>
>>>
>>>
>> Had you tested this code with kgdb boot tests at all?
>>
>> CONFIG_LOCKUP_DETECTOR=y
>> CONFIG_HARDLOCKUP_DETECTOR=y
>> CONFIG_KGDB=y
>> CONFIG_KGDB_TESTS_ON_BOOT=y
>> CONFIG_KGDB_TESTS_BOOT_STRING="V1F100"
>>
>> There has been a regression in kgdb due to the use of perf/NMI in the
>> lockup detector ever since the new version has been introduced. The
>> perf callbacks in the lockup detector were consuming NMI events not
>> related to the call back and causing the kernel debugger not to work at
>> all on SMP systems configured with the lockup detector.
>>
>
> Well 2.6.36 should have fixed that. Perf was blindly eating all NMI
> events if it had a user. With the new lockup detector, that created a
> 'user' for perf and it happily ate everything. But we spent a lot of time
> trying to fix that for 2.6.36. If we missed something, we would like to
> know.
>
> To answer your question, I doubt this patch series will change that
> outcome if it is still broken.
>
>

It was most definitely broken in 2.6.36->2.6.37-rc1. Randy Dunlap had
pointed this out in a separate exchange that was not on LKML.

The symptom you would see looks like:

...kernel boot...
Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:06: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
brd: module loaded
kgdb: Registered I/O driver kgdbts.
kgdbts:RUN plant and detach test
[...HARD HANG STARTS HERE...]

The kernel is looping at that point waiting for the master kgdb cpu to
have all the slaves join the debugger but it never happens because the
perf callback chain which is used by the lockup detector eats the NMI
IPI event. After the perf callback is processed perf returns
NOTIFY_STOP so the notifier which brings the slave CPU into the debugger
never fires.

You can even see the behavior booting a kernel with the kgdb tests using
kvm with -smp 2.

I did build with your 6 part series, and the behavior is no different
(meaning it is still broken).

Jason.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/