Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

From: Andy Lutomirski
Date: Mon May 02 2016 - 11:53:13 EST

On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@xxxxxxxxxx> wrote:
>> >
>> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> > > >> I suppose we could try to rejigger the code so that rbp points to
>> > > >> pt_regs or similar.
>> > > >
>> > > > I think we should avoid doing something like that because it would break
>> > > > gdb and all the other unwinders who don't know about it.
>> > >
>> > > How so?
>> > >
>> > > Currently, rbp in the entry code is meaningless. I'm suggesting that,
>> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> > > the pt_regs. Currently it points to something stale (which the
>> > > dump_stack code might be relying on. Hmm.) But it's probably also
>> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> > > is the next thing on the stack, so just doing the section thing would
>> > > work.
>> >
>> > Yes, rbp is meaningless on the entry from user space. But if an
>> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> > nested entry, rbp keeps its old value, right? So the unwinder can walk
>> > past the nested entry frame and keep going until it gets to the original
>> > entry.
>> Yes.
>> It would be nice if we could do better, though, and actually notice
>> the pt_regs and identify the entry. For example, I'd love to see
>> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> crash.
>> Also, I think that just following rbp links will lose the
>> actual function that took the page fault (or whatever function
>> pt_regs->ip actually points to).
> Hm. I think we could fix all that in a more standard way. Whenever a
> new pt_regs frame gets saved on entry, we could also create a new stack
> frame which points to a fake kernel_entry() function. That would tell
> the unwinder there's a pt_regs frame without otherwise breaking frame
> pointers across the frame.
> Then I guess we wouldn't need my other solution of putting the idt
> entries in a special section.
> How does that sound?

Let me try to understand.

The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp
points to (prev rbp, prev rip) on the stack, and you can follow the
chain back. Right now, on a user access page fault or similar, we
have rbp (probably) pointing to the interrupted frame, and the
interrupted rip isn't saved anywhere that a naive unwinder can find
it. (It's in pt_regs, but the rbp chain skips right over that.)

We could change the entry code so that an interrupt / idtentry does:

push pt_regs
push kernel_entry
push %rbp
mov %rsp, %rbp
call handler
pop %rbp
addq $8, %rsp

or similar. That would make it appear that the actual C handler was
caused by a dummy function "kernel_entry". Now the unwinder would get
to kernel_entry, but it *still* wouldn't find its way to the calling
frame, which only solves part of the problem. We could at least teach
the unwinder how kernel_entry works and let it decode pt_regs to
continue unwinding. This would be nice, and I think it could work.

I think I like this, except that, if it used a separate section, it
could potentially be faster, as, for each actual entry type, the
offset from the C handler frame to pt_regs is a foregone conclusion.
But this is pretty simple and performance is already abysmal in most

There's an added benefit to using a separate section, though: we could
also annotate the calls with what type of entry they were so the
unwinder could print it out nicely.

I could be convinced either way.

>> Have you looked at my vdso unwinding test at all? If we could do
>> something similar for the kernel, IMO it would make testing much more
>> pleasant.
> I found it, but I'm not sure what it would mean to do something similar
> for the kernel. Do you mean doing something like an NMI sampling-based
> approach where we periodically do a random stack sanity check?

I was imagining something a little more strict: single-step
interesting parts of the kernel and make sure that each step unwinds
correctly. That could detect missing frames and similar.