Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
From: Andy Lutomirski
Date: Tue Jul 26 2016 - 18:38:25 EST
On Tue, Jul 26, 2016 at 3:24 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
>> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> >> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> >> >> > since they're all on the same stack. I can try to test it. What in
>> >> >> > particular are you worried about?
>> >> >> >
>> >> >>
>> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> >> CS, IP) records. Off the top of my head, the record that matters is
>> >> >> the third one, so it should be regs[-15]. If an MCE hits while this
>> >> >> mess is being set up, good luck unwinding *that*. If you really want
>> >> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> >> architecture.
>> >> >
>> >> > I did read that comment. Fortunately there's a big difference between
>> >> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >> >
>> >> > For nested NMIs, it does look like the stack of the exception which
>> >> > interrupted the first NMI would get skipped by the stack dump. (But
>> >> > that's a general problem, not specific to my patch set.)
>> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> >> we're going to crash, so it doesn't matter much whether the unwinder
>> >> works. Is that what you mean?
>> > I read the NMI entry code again, and now I realize my comment was
>> > completely misinformed, so never mind.
>> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
>> > higher priority than NMI?
>> Priority only matters for events that happen concurrently.
>> Synchronous things like #DB will always fire if the conditions that
>> trigger them are hit,
> So just to clarify, are you saying a lower priority exception like NMI
> can interrupt a higher priority exception handler like #DB? I'm getting
> a different conclusion from reading section 6.9 of the Intel System
> Programming Guide.
Yes, effectively. From the CPU's perspective, it's done with the #DB
as soon as it finishes pushing the stack frame and starts running
instructions again. So the chain of events looks like:
<-- CPU is delivering #DB. NMI can't be delivered.
<-- Oh boy, done with delivering #DB. NMIs can be delivered again!
iretq <-- CPU has no idea that this is related to the #DB
>> >> > Am I correct in understanding that there can only be one level of NMI
>> >> > nesting at any given time? If so, could we make it easier on the
>> >> > unwinder by putting the nested NMI on a separate software stack, so the
>> >> > "next stack" pointers are always in the same place? Or am I just being
>> >> > naive?
>> >> I think you're being naive :)
>> >> But we don't really need the unwinder to be 100% faithful here. If we have:
>> >> task stack
>> >> NMI
>> >> nested NMI
>> >> then the nested NMI code won't call into C and thus it should be
>> >> impossible to ever invoke your unwinder on that state. Instead the
>> >> nested NMI code will fiddle with the saved regs and return in such a
>> >> way that the outer NMI will be forced to loop through again. So it
>> >> *should* (assuming I'm remembering all this crap correctly) be
>> >> sufficient to find the "outermost" pt_regs, which is sitting at
>> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> >> value. This ought to be the same thing that the frame-based unwinder
>> >> would naturally try to do. But if you make this change, ISTM you
>> >> should make it separately because it does change behavior and Linus is
>> >> understandably leery about that.
>> > Ok, I think that makes sense to me now. As I understand it, the
>> > "outermost" RIP is the authoritative one, because it was written by the
>> > original NMI. Any nested NMIs will update the original and/or iret
>> > RIPs, which will only ever point to NMI entry code, and so they should
>> > be ignored.
>> > But I think there's a case where this wouldn't work:
>> > task stack
>> > NMI
>> > IST
>> > stack dump
>> > If the IST interrupt hits before the NMI has a chance to update the
>> > outermost regs, the authoritative RIP would be the original one written
>> > by HW, right?
>> This should be impossible unless that last entry is MCE. If we
>> actually fire an event that isn't MCE early in NMI entry, something
>> already went very wrong.
> So we don't need to support breakpoints in the early NMI entry code?
No. Instead we try not to let it happen. See, for example:
Author: Andy Lutomirski <luto@xxxxxxxxxx>
Date: Thu Jul 30 20:32:40 2015 -0700
perf/x86/hw_breakpoints: Disallow kernel breakpoints unless kprobe-safe
>> Be careful, though: kernel threads might not have a "user" pt_regs in
>> the "user_mode" returns true sense. Checking that it's either
>> user_mode() or at task_pt_regs() might be a good condition to check.
> Yeah. I guess there are two distinct cases of "going off the rails":
> 1) The unwinder doesn't get to the end of the stack (user regs for user
> tasks, or whatever the end is for kthreads).
> 2) The unwinder strays away from the current stack's "previous stack"
> We could warn on either case (though there's probably overlap between
> the two).
I'm in favor of both. But I think it's best to do them at the end the
series so that they're easy to revert in the event that Linus
complains and neither of us can convince him that's it's okay.
AMA Capital Management, LLC