Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

From: Andy Lutomirski
Date: Mon Jul 25 2016 - 20:10:14 EST

On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> >> > + unsigned long *visit_mask)
>> >> > +{
>> >> > + unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> >> > + unsigned long *end = begin + (THREAD_SIZE / sizeof(long));
>> >> > +
>> >> > + if (stack < begin || stack >= end)
>> >> > + return false;
>> >> > +
>> >> > + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> >> > + return false;
>> >> > +
>> >> > + info->type = STACK_TYPE_IRQ;
>> >> > + info->begin = begin;
>> >> > + info->end = end;
>> >> > + info->next = (unsigned long *)*begin;
>> >>
>> >> This works, but it's a bit magic. I don't suppose we could get rid of
>> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> >> move from stack to stack by querying the stack type of whatever the
>> >> frame base address is if the frame base address ends up being out of
>> >> bounds for the current stack? Or maybe the unwinder could even do
>> >> this by itself.
>> >
>> > I'm not quite sure what you mean here. The ->next stack pointer is
>> > quite useful and it abstracts that ugliness away from the callers of
>> > get_stack_info(). I'm open to any specific suggestions.
>> So far I've found two users of this thing. One is
>> show_stack_log_lvl(), and it makes sense there, but maybe
>> info->heuristic_next_stack would be a better name. The other is the
>> unwinder itself, and I think that walking from stack to stack using
>> this heuristic is the wrong approach there, at least in the long term.
>> I'd rather we just followed the bp chain wherever it leads us, as long
>> as it leads us to a valid stack that we haven't visited before.
>> As a concrete example of what I think is wrong with the current
>> approach, ISTM it would be totally valid to implement stack switching
>> like this:
>> some_func:
>> push %rbp
>> mov %rsp, %rbp
>> ...
>> mov [next stack], %rsp
>> call some_other_func
>> mov %rbp, %rsp
>> pop %rbp
>> ret
>> With the current approach, you can't unwind out of that function,
>> because there's no way to populate info->next. I'm not actually
>> suggesting that the kernel should ever do such a thing on x86, and my
>> proposed rewrite of the IRQ stack code [1] will be fully compatible
>> with your approach, but it seems odd to me that the unwinder should
>> depend on idea that the stacks in use are chained together in a way
>> that can be decoded without . (But maybe some of the Go compilers do
>> work this way -- I've never looked at their precise stack layout.)
> I don't think relying on frame pointers to switch between stacks is
> necessarily a good idea:
> - It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
> The current approach is unwinder-agnostic.
> - Instead of relying on a single correct "next stack" pointer, it
> requires relying on potentially dozens of correct frame pointers,
> across multiple stacks. So a lot of things have to go right, instead
> of just one. And then show_trace_log_lvl() becomes more dependent on
> the unwinder not screwing things up.

That's a fair point, at least for show_trace_log_lvl(). So let's
leave it alone for now. We can always revisit it later.

>> If you can write it as:
>> struct pt_regs *regs = (struct pt_regs *)end - 1;
>> info->next = regs->sp;
>> and it still works, then no comment required :)
> Yeah. in_irq_stack() does something similar, though it uses end[-1].
> And its regs are actually stored on the thread stack. So something
> doesn't quite add up for irqs. I still need to do some homework there.

I can do your homework for you: the irq stacks doesn't contain pt_regs.

The current code is quite hard to understand, but this patch of mine
(which I'll try to dust off and send in soon) cleans it up and should
be much easier to understand:

So a comment like /* When the IRQ stack is in use, the top word stores
the previous stack pointer. */ should do the trick.

>> >
>> > Unless I'm missing something, I think it should be fine for nested NMIs,
>> > since they're all on the same stack. I can try to test it. What in
>> > particular are you worried about?
>> >
>> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
>> CS, IP) records. Off the top of my head, the record that matters is
>> the third one, so it should be regs[-15]. If an MCE hits while this
>> mess is being set up, good luck unwinding *that*. If you really want
>> to know, take a deep breath, read the long comment in entry_64.S after
>> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
>> architecture.
> I did read that comment. Fortunately there's a big difference between
> reading and understanding so I can go on being an ignorant x86 hacker!
> For nested NMIs, it does look like the stack of the exception which
> interrupted the first NMI would get skipped by the stack dump. (But
> that's a general problem, not specific to my patch set.)

If we end up with task -> IST -> NMI -> same IST, we're doomed and
we're going to crash, so it doesn't matter much whether the unwinder
works. Is that what you mean?

> Am I correct in understanding that there can only be one level of NMI
> nesting at any given time? If so, could we make it easier on the
> unwinder by putting the nested NMI on a separate software stack, so the
> "next stack" pointers are always in the same place? Or am I just being
> naive?

I think you're being naive :)

But we don't really need the unwinder to be 100% faithful here. If we have:

task stack
nested NMI

then the nested NMI code won't call into C and thus it should be
impossible to ever invoke your unwinder on that state. Instead the
nested NMI code will fiddle with the saved regs and return in such a
way that the outer NMI will be forced to loop through again. So it
*should* (assuming I'm remembering all this crap correctly) be
sufficient to find the "outermost" pt_regs, which is sitting at
(struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp
value. This ought to be the same thing that the frame-based unwinder
would naturally try to do. But if you make this change, ISTM you
should make it separately because it does change behavior and Linus is
understandably leery about that.

Hmm. I wonder if it would make sense to decode this thing both ways
and display it. So show_trace_log_lvl() could print something like:

<#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething>

and, in the case where the frame unwinder disagrees, it'll at least be
visible in that 0xffffsomething won't be between 0xffffwhatever000 and

Then Linus is happy because the unwinder works just like it did before
but people like me are also happy because it's clearer what's going
on. FWIW, I've personally debugged crashes in the NMI code where the
current unwinder falls apart completely and it's not fun -- having a
display indicating that the unwinder got confused would be nice.