Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

From: Andy Lutomirski
Date: Tue Jul 26 2016 - 16:59:48 EST


On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote:
>> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> >> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> >> > since they're all on the same stack. I can try to test it. What in
>> >> > particular are you worried about?
>> >> >
>> >>
>> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> >> CS, IP) records. Off the top of my head, the record that matters is
>> >> the third one, so it should be regs[-15]. If an MCE hits while this
>> >> mess is being set up, good luck unwinding *that*. If you really want
>> >> to know, take a deep breath, read the long comment in entry_64.S after
>> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> >> architecture.
>> >
>> > I did read that comment. Fortunately there's a big difference between
>> > reading and understanding so I can go on being an ignorant x86 hacker!
>> >
>> > For nested NMIs, it does look like the stack of the exception which
>> > interrupted the first NMI would get skipped by the stack dump. (But
>> > that's a general problem, not specific to my patch set.)
>>
>> If we end up with task -> IST -> NMI -> same IST, we're doomed and
>> we're going to crash, so it doesn't matter much whether the unwinder
>> works. Is that what you mean?
>
> I read the NMI entry code again, and now I realize my comment was
> completely misinformed, so never mind.
>
> Is "IST -> NMI -> same IST" even possible, since the other IST's are
> higher priority than NMI?

Priority only matters for events that happen concurrently.
Synchronous things like #DB will always fire if the conditions that
trigger them are hit,

>
>> > Am I correct in understanding that there can only be one level of NMI
>> > nesting at any given time? If so, could we make it easier on the
>> > unwinder by putting the nested NMI on a separate software stack, so the
>> > "next stack" pointers are always in the same place? Or am I just being
>> > naive?
>>
>> I think you're being naive :)
>>
>> But we don't really need the unwinder to be 100% faithful here. If we have:
>>
>> task stack
>> NMI
>> nested NMI
>>
>> then the nested NMI code won't call into C and thus it should be
>> impossible to ever invoke your unwinder on that state. Instead the
>> nested NMI code will fiddle with the saved regs and return in such a
>> way that the outer NMI will be forced to loop through again. So it
>> *should* (assuming I'm remembering all this crap correctly) be
>> sufficient to find the "outermost" pt_regs, which is sitting at
>> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
>> value. This ought to be the same thing that the frame-based unwinder
>> would naturally try to do. But if you make this change, ISTM you
>> should make it separately because it does change behavior and Linus is
>> understandably leery about that.
>
> Ok, I think that makes sense to me now. As I understand it, the
> "outermost" RIP is the authoritative one, because it was written by the
> original NMI. Any nested NMIs will update the original and/or iret
> RIPs, which will only ever point to NMI entry code, and so they should
> be ignored.
>
> But I think there's a case where this wouldn't work:
>
> task stack
> NMI
> IST
> stack dump
>
> If the IST interrupt hits before the NMI has a chance to update the
> outermost regs, the authoritative RIP would be the original one written
> by HW, right?

This should be impossible unless that last entry is MCE. If we
actually fire an event that isn't MCE early in NMI entry, something
already went very wrong.

For NMI -> MCE -> stack dump, the frame-based unwinder will do better
than get_stack_info() unless get_stack_info() learns to use the
top-of-stack hardware copy if the actual RSP it finds is too high
(above the "outermost" frame).

>
>> Hmm. I wonder if it would make sense to decode this thing both ways
>> and display it. So show_trace_log_lvl() could print something like:
>>
>> <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>
>>
>> and, in the case where the frame unwinder disagrees, it'll at least be
>> visible in that 0xffffsomething won't be between 0xffffwhatever000 and
>> 0xffffwhateverfff.
>>
>> Then Linus is happy because the unwinder works just like it did before
>> but people like me are also happy because it's clearer what's going
>> on. FWIW, I've personally debugged crashes in the NMI code where the
>> current unwinder falls apart completely and it's not fun -- having a
>> display indicating that the unwinder got confused would be nice.
>
> Hm, maybe. Another idea would be to have the unwinder print some kind
> of warning if it goes off the rails. It should be able to detect that,
> because every stack trace should end at a user pt_regs.

I like that.

Be careful, though: kernel threads might not have a "user" pt_regs in
the "user_mode" returns true sense. Checking that it's either
user_mode() or at task_pt_regs() might be a good condition to check.

--Andy