Re: 4.4-rc5 Setting trap flag inside nmi handler results in HARD LOCKUP

From: Jeff Merkey
Date: Thu Dec 17 2015 - 02:23:21 EST


On 12/16/15, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
> Setting the (trap flag | resume flag) inside of an nmi handler results
> in a hard lockup while setting the resume flag works fine.
>
> The watchdog detector fails to detect the lockup. I am currently
> examining the trap gate and interrupt gate setup on Linux and if
> anyone has any ideas it would be nice to be able to debug and step
> through the nmi handlers. I got breakpoints to work. I noticed
> kgdb/kdb just punts here and refuses to allow someone to step inside
> an nmi handler.
>
> There is no reason Linux should not allow this to work since windows
> does and every other OS out there. I have seen this across some rex64
> sysret calls as well this lockup behavior.
>
> Anyone who is an intel expert with any clues would love some input if
> you know about this problem.
>
> Jeff
>

This bug has been located. Results from returning from NMI interrupt
with trap flag set in to a userspace address as Andy suspected but its
not due to the RSP value being different as he suggested. This is a
separate bug from the rex64 sysret bug.

Results in the NMI handler switching IDT entries if an NMI fires off
in a debug stack. Ironic since the code claims it is switching stacks
to enable debugging of NMI handlers and does the opposite -- breaks
them. Commenting out this code gets rid of the hard lockup. The user
space process that gets the trap flag and doesn't expect a trap flag
just hangs (but the just that process the rest of the system keeps
running).

So a few bugs to run down still. NMI handlers can now be debugged -- kindof.

This bug is closed and I will issue a patch for it. It's a condition
where a trap flag is set inside an nmi handler that exits to a
userspace address. The code for setting and clearing the trap in
kernel all worked correctly for the userspace path, except it put the
process to sleep when it shouldn't have. It's not a condition that
can happen during normal operations unless you set the trap flag from
a debugger inside an NMI handler and try to debug it then exit the
handler into userspace, so I think the probability of this showing up
outside a debugging session is low.

I verified that kgdb/kdb also experiences this bug (If I comment out
the code blocking folks from debugging NMI handlers with kgdb/kdb).

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/