Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

From: Josh Poimboeuf
Date: Fri Jul 22 2016 - 11:57:20 EST


On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> > an interrupt or exception, we can print them. Here's what it looks
> >> > like:
> >> >
> >> > ...
> >> > [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> > [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> > RIP: 0010:[<ffffffff814529e2>] [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> > RSP: 0018:ffff88007876fd38 EFLAGS: 00010202
> >> > RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> > RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> > RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> > R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> > R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> > [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> > [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> > [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> > ...
> >>
> >> This looks wrong. Here are some theories:
> >>
> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> Then it's found again as an unreliable address as "?
> >> __clear_user+0x42/0x70" by scanning the stack. "?
> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> suppressed because it matched a reliable address?
> >>
> >> (b) You actually intended for all the addresses to be printed, in
> >> which case "? __clear_user+0x42/0x70" should have been
> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's
> >> plausible that your state machine got a bit lost leading to "?
> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> it's a real frame and you didn't find it).
> >>
> >> (c) Something else and I'm confused.
> >
> > So there's a subtle difference between addresses reported by regs->ip
> > and normal return addresses. For example:
> >
> > ...
> > [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> > [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> > RIP: 0010:[<ffffffff8129b350>] [<ffffffff8129b350>] path_init+0x0/0x750
> > RSP: 0018:ffff880036a3fd80 EFLAGS: 00000296
> > RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> > RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> > RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> > R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> > R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> > <EOI>
> > [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> > [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> > [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> > ...
> >
> > In this case the irq hit right after path_lookupat() called into
> > path_init(). So the "path_init+0x0" printed by __show_regs() is right.
> >
> > Note the backtrace reports the same address, but it instead describes it
> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's
> > because normally, such an address after a call instruction at the end of
> > a function would indicate a tail call (e.g., to a noreturn function).
> > If that were the case, printing "path_init+0x0" would be completely
> > wrong, because path_init() just happens to be the function located
> > immediately after lookup_fast().
> >
> > Maybe I could add some special logic to say: "if this return address was
> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> > use printk_address()." (The former prints the preceding function, the
> > latter prints the current function.) Then we could remove the question
> > mark.
> >
> > There's also the question of whether or not the address should be
> > printed again, after it's already been printed by __show_regs(). I
> > don't have a strong opinion either way.
> >
>
> IIRC we don't show the actual faulting function in the call trace, so
> we probably shouldn't duplicate the entry after the show_regs.

Just to clarify, that's true today for cases where the stack dump starts
from a handler which has regs. It starts dumping based on regs->ip and
regs->bp, so the regs themselves aren't dumped.

But for cases where regs are in the middle of the stack, they aren't
detected today, and you'll still see the value of regs->ip dumped with a
question mark.

That said, with this patch, now that regs in the middle of the stack
*are* being printed, I can't think of a good reason to print the return
address twice: both in regs and the stack trace. So removing it from
the stack trace is fine with me.

> That being said, I'm still confused by the question marks. What
> exactly is going on? Is the code really doing the right thing wrt
> resuming the unwind? Is there a git tree with these patches applied
> somewhere so I can look at it easily in context?

show_trace_log_lvl() is doing two things in parallel: scanning all
kernel text addresses on the stack while simultaneously using the
unwinder to walk the frame pointers. Only those scanned addresses which
are also found by the unwinder are printed without question marks.

The pt_regs aren't in a frame of their own; they're just data inside of
a bigger frame. (You may recall that you objected to my proposal to put
them in their own frame :-)) So that's why the address stored in
regs->ip was printed with a question mark: it's not in the header of a
real frame; it's just data.

I pushed the code out at:

https://github.com/jpoimboe/linux unwind-v1

See show_trace_log_lvl() in arch/x86/kernel/dumpstack.c.

--
Josh