Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

From: Josh Poimboeuf
Date: Tue Jul 26 2016 - 18:25:05 EST


On Tue, Jul 26, 2016 at 01:59:20PM -0700, Andy Lutomirski wrote:
> On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> >> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
> >> >> > since they're all on the same stack. I can try to test it. What in
> >> >> > particular are you worried about?
> >> >> >
> >> >>
> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> >> >> CS, IP) records. Off the top of my head, the record that matters is
> >> >> the third one, so it should be regs[-15]. If an MCE hits while this
> >> >> mess is being set up, good luck unwinding *that*. If you really want
> >> >> to know, take a deep breath, read the long comment in entry_64.S after
> >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> >> >> architecture.
> >> >
> >> > I did read that comment. Fortunately there's a big difference between
> >> > reading and understanding so I can go on being an ignorant x86 hacker!
> >> >
> >> > For nested NMIs, it does look like the stack of the exception which
> >> > interrupted the first NMI would get skipped by the stack dump. (But
> >> > that's a general problem, not specific to my patch set.)
> >>
> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and
> >> we're going to crash, so it doesn't matter much whether the unwinder
> >> works. Is that what you mean?
> >
> > I read the NMI entry code again, and now I realize my comment was
> > completely misinformed, so never mind.
> >
> > Is "IST -> NMI -> same IST" even possible, since the other IST's are
> > higher priority than NMI?
>
> Priority only matters for events that happen concurrently.
> Synchronous things like #DB will always fire if the conditions that
> trigger them are hit,

So just to clarify, are you saying a lower priority exception like NMI
can interrupt a higher priority exception handler like #DB? I'm getting
a different conclusion from reading section 6.9 of the Intel System
Programming Guide.

> >> > Am I correct in understanding that there can only be one level of NMI
> >> > nesting at any given time? If so, could we make it easier on the
> >> > unwinder by putting the nested NMI on a separate software stack, so the
> >> > "next stack" pointers are always in the same place? Or am I just being
> >> > naive?
> >>
> >> I think you're being naive :)
> >>
> >> But we don't really need the unwinder to be 100% faithful here. If we have:
> >>
> >> task stack
> >> NMI
> >> nested NMI
> >>
> >> then the nested NMI code won't call into C and thus it should be
> >> impossible to ever invoke your unwinder on that state. Instead the
> >> nested NMI code will fiddle with the saved regs and return in such a
> >> way that the outer NMI will be forced to loop through again. So it
> >> *should* (assuming I'm remembering all this crap correctly) be
> >> sufficient to find the "outermost" pt_regs, which is sitting at
> >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
> >> value. This ought to be the same thing that the frame-based unwinder
> >> would naturally try to do. But if you make this change, ISTM you
> >> should make it separately because it does change behavior and Linus is
> >> understandably leery about that.
> >
> > Ok, I think that makes sense to me now. As I understand it, the
> > "outermost" RIP is the authoritative one, because it was written by the
> > original NMI. Any nested NMIs will update the original and/or iret
> > RIPs, which will only ever point to NMI entry code, and so they should
> > be ignored.
> >
> > But I think there's a case where this wouldn't work:
> >
> > task stack
> > NMI
> > IST
> > stack dump
> >
> > If the IST interrupt hits before the NMI has a chance to update the
> > outermost regs, the authoritative RIP would be the original one written
> > by HW, right?
>
> This should be impossible unless that last entry is MCE. If we
> actually fire an event that isn't MCE early in NMI entry, something
> already went very wrong.

So we don't need to support breakpoints in the early NMI entry code?

> For NMI -> MCE -> stack dump, the frame-based unwinder will do better
> than get_stack_info() unless get_stack_info() learns to use the
> top-of-stack hardware copy if the actual RSP it finds is too high
> (above the "outermost" frame).

Ok.

> >> Hmm. I wonder if it would make sense to decode this thing both ways
> >> and display it. So show_trace_log_lvl() could print something like:
> >>
> >> <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>
> >>
> >> and, in the case where the frame unwinder disagrees, it'll at least be
> >> visible in that 0xffffsomething won't be between 0xffffwhatever000 and
> >> 0xffffwhateverfff.
> >>
> >> Then Linus is happy because the unwinder works just like it did before
> >> but people like me are also happy because it's clearer what's going
> >> on. FWIW, I've personally debugged crashes in the NMI code where the
> >> current unwinder falls apart completely and it's not fun -- having a
> >> display indicating that the unwinder got confused would be nice.
> >
> > Hm, maybe. Another idea would be to have the unwinder print some kind
> > of warning if it goes off the rails. It should be able to detect that,
> > because every stack trace should end at a user pt_regs.
>
> I like that.
>
> Be careful, though: kernel threads might not have a "user" pt_regs in
> the "user_mode" returns true sense. Checking that it's either
> user_mode() or at task_pt_regs() might be a good condition to check.

Yeah. I guess there are two distinct cases of "going off the rails":

1) The unwinder doesn't get to the end of the stack (user regs for user
tasks, or whatever the end is for kthreads).

2) The unwinder strays away from the current stack's "previous stack"
pointer.

We could warn on either case (though there's probably overlap between
the two).

--
Josh