Re: recent x86-64 nested NMI adjustments

From: Steven Rostedt
Date: Mon Mar 12 2012 - 09:16:56 EST


On Mon, 2012-03-12 at 12:10 +0000, Jan Beulich wrote:
> Hi Steven,
>
> the explanation of 45d5a1683c04be28abdf5c04c27b1417e0374486
> seems bogus to me: When arriving from user mode, %rsp won't point
> to the user stack anymore, as it gets switched away from during the
> processing of the exception (the more that the IDT entry specifies a
> separate stack anyway, which even guarantees this for kernel mode
> entries).

No it is real, and I had a test program that exploited it. I'm not
worried about the current %rsp, I'm worried about what %rsp is saved on
the stack. Two things are used to check if the incoming NMI is nested or
not.

1) if the on-stack "in-nmi" variable is set

2) if the saved %rsp is pointing to the NMI stack.

Note, #2 looks at the *saved* %rsp. Which is the %rsp at the time the
NMI triggered. The second check is used to handle the case that a nested
NMI came in after the previous NMI cleared the on-stack "in-nmi"
variable, but before it calls the iret.

There are few cases that the stack can change in the NMI so the variable
is also used.

There's a really good article on LWN about this :-)

https://lwn.net/Articles/484932/

(subscription required, but you should have one)

That said, I added a printk into the boot up to show me where the NMI
stacks were located. Then I wrote a program that would pin itself to a
CPU and change its stack pointer to point into the NMI stack of that CPU
and then go into an infinite loop. I ran perf on this code and it became
"invisible" to perf. That is, every time the NMI came in while this code
was running, it incorrectly considered itself a nested NMI and returned,
never recording the presence of this program.

After adding this patch, perf shows the task spending 99.9% of the time
in this loop. Thus this is a real bug.


>
> Further, a38449ef596b345e13a8f9b7d5cd9fedb8fcf921 makes the
> (presumably superfluous) compare a 4-byte one, while the
> documentation isn't really stating that selectors get pushed zero-
> extended. Hence, if not reverting the first change altogether, I'd
> minimally recommend converting the compare to a 2-byte one.

I'll let H. Peter answer this one, he's the Intel representative here.


-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/