Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

From: Josh Poimboeuf
Date: Mon May 02 2016 - 13:32:04 EST

On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@xxxxxxxxxx> wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs. Currently it points to something stale (which the
> >> > > dump_stack code might be relying on. Hmm.) But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space. But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry. For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm. I think we could fix all that in a more standard way. Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function. That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> Let me try to understand.
> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back. Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it. (It's in pt_regs, but the rbp chain skips right over that.)
> We could change the entry code so that an interrupt / idtentry does:
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> or similar. That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry". Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem. We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding. This would be nice, and I think it could work.

Yeah, that's about what I had in mind.

> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.

Hm, this I don't really follow. It's true that the unwinder can easily
find RIP from pt_regs, which will always be a known offset from the
kernel_entry pointer on the stack. But why would having the entry code
in a separate section make that faster?

> But this is pretty simple and performance is already abysmal in most
> handlers.
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.

Yeah, that could be a nice feature... but doesn't printing the name of
the C handler pretty much already give that information?

In any case, once we have a working DWARF unwinder, I think it will show
the name of the idt entry anyway.

> >> Have you looked at my vdso unwinding test at all? If we could do
> >> something similar for the kernel, IMO it would make testing much more
> >> pleasant.
> >
> > I found it, but I'm not sure what it would mean to do something similar
> > for the kernel. Do you mean doing something like an NMI sampling-based
> > approach where we periodically do a random stack sanity check?
> I was imagining something a little more strict: single-step
> interesting parts of the kernel and make sure that each step unwinds
> correctly. That could detect missing frames and similar.

Interesting idea, though I wonder how hard it would be to reliably
distinguish a missing frame from the case where gcc decides to inline a

Another idea to detect missing frames: for each return address on the
stack, ensure there's a corresponding "call <func>" instruction
immediately preceding the return location, where <func> matches what's
on the stack.

That wouldn't work so well for indirect calls using function pointers,
but even then maybe we could use the DWARF CFI to find the function
pointer value and validate that it matches the stack function.