Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
From: Andy Lutomirski
Date: Wed Jun 22 2016 - 13:59:48 EST
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
>> On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoimboe@xxxxxxxxxx> wrote:
>> > > Maybe I'm coming around to liking this idea.
>> >
>> > Ok, good :-)
>> >
>> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
>> > > printer, etc), unwinding across a kernel exception would look like:
>> > >
>> > > - some_func
>> > > - some_other_func
>> > > - do_page_fault
>> > > - page_fault
>> > >
>> > > After page_fault, the next unwind step takes us to the faulting RIP
>> > > (faulting_func) and reports that all GPRs are known. It should
>> > > probably learn this fact from DWARF if DWARF is available, instead of
>> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
>> > > may be incomplete. A nice pretty printer could now print all the
>> > > regs.
>> > >
>> > > - faulting_func
>> > > - etc.
>> > >
>> > > For this to work, we don't actually need the unwinder to explicitly
>> > > know where pt_regs is.
>> >
>> > That's true (but only for DWARF).
>> >
>> > > Food for thought, though: if user code does SYSENTER with TF set,
>> > > you'll end up with partial pt_regs. There's nothing the kernel can do
>> > > about it. DWARF will handle it without any fanfare, but I don't know
>> > > if it's going to cause trouble for you pre-DWARF.
>> >
>> > In this case it should see the stack pointer is past the pt_regs offset,
>> > so it would just report it as an empty stack.
>>
>> OK
>>
>> >
>> > > I'm also not sure it makes sense to apply this before the unwinder
>> > > that can consume it is ready. Maybe, if it would be consistent with
>> > > your plans, it would make sense to rewrite the unwinder first, then
>> > > apply this and teach live patching to use the new unwinder, and *then*
>> > > add DWARF support?
>> >
>> > For the purposes of livepatch, the reliable unwinder needs to detect
>> > whether an interrupt/exception pt_regs frame exists on a sleeping task
>> > (or current). This patch would allow us to do that.
>> >
>> > So my preferred order of doing things would be:
>> >
>> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
>> > 2) this patch for annotating pt_regs stack frames
>> > 3) reliable unwinder, similar to what I already posted, except it relies
>> > on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>> > the new inactive task frame format of #1
>> > 4) livepatch consistency model which uses the reliable unwinder
>> > 5) rewrite unwinder, and port all users to the new interface
>> > 6) add DWARF unwinder
>> >
>> > 1-4 are pretty much already written, whereas 5 and 6 will take
>> > considerably more work.
>>
>> Fair enough.
>>
>> >
>> > > > + /*
>> > > > + * Create a stack frame for the saved pt_regs. This allows frame
>> > > > + * pointer based unwinders to find pt_regs on the stack.
>> > > > + */
>> > > > + .macro CREATE_PT_REGS_FRAME regs=%rsp
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > + pushq \regs
>> > > > + pushq $pt_regs+1
>> > >
>> > > Why the +1?
>> >
>> > Some unwinders like gdb are smart enough to report the function which
>> > contains the instruction *before* the return address. Without the +1,
>> > they would show the wrong function.
>>
>> Lovely. Want to add a comment?
>>
>> >
>> > > > + pushq %rbp
>> > > > + movq %rsp, %rbp
>> > > > +#endif
>> > > > + .endm
>> > >
>> > > I keep wanting this to be only two pushes and to fudge rbp to make it
>> > > work, but I don't see a good way. But let's call it
>> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
>> > > nested_frame or similar.
>> >
>> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
>> > more specific name. CREATE_INTERRUPT_FRAME and interrupt_frame()?
>>
>> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
>> too. CREATE_PT_REGS_FRAME is probably fine.
>>
>> > > > +
>> > > > +/* fake function which allows stack unwinders to detect pt_regs frames */
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +ENTRY(pt_regs)
>> > > > + nop
>> > > > + nop
>> > > > +ENDPROC(pt_regs)
>> > > > +#endif /* CONFIG_FRAME_POINTER */
>> > >
>> > > Why is this two bytes long? Is there some reason it has to be more
>> > > than one byte?
>> >
>> > Similar to above, this is related to the need to support various
>> > unwinders. Whether the unwinder displays the ret_addr or the
>> > instruction preceding it, either way the instruction needs to be inside
>> > the function for the function to be reported.
>>
>> OK.
>
> Andy,
>
> So I got a chance to look at this some more. I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: preempt_schedule_irq, CALL_enter_from_user_mode,
> prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> etc. That way, the unwinder will always be able to find pt_regs from an
> interrupt/exception, even if starting from one of these other calls.
>
> But then, things get ugly. You have to either setup and tear down the
> frame for every possible call, or do a higher-level setup/teardown
> across multiple calls, which invalidates several assumptions in the
> entry code about the location of pt_regs on the stack.
>
> Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> make assumptions about the location of pt_regs. And they're used by
> both syscall and interrupt code. So if we didn't create a frame pointer
> header for syscalls, we'd basically need two versions of the macros: one
> for irqs/exceptions and one for syscalls.
>
> So I think the cleanest way to handle this is to always allocate two
> extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK. Then all
> entry code can assume that pt_regs is at a constant location, and all
> the above problems go away. Another benefit is that we'd only need two
> saves instead of three -- the pointer to pt_regs is no longer needed
> since pt_regs is always immediately after the frame header.
>
> I worked up a patch to implement this -- see below. It writes the frame
> pointer in all entry paths, including syscalls. This helps keep the
> code simple.
>
> The downside is a small performance penalty: with getppid()-in-a-loop on
> my laptop, the average syscall went from 52ns to 53ns, which is about a
> 2% slowdown. But I doubt it would be measurable in a real-world
> workload.
>
> It looks like about half the slowdown is due to the extra stack
> allocation (which presumably adds a little d-cache pressure on the stack
> memory) and the other half is due to the stack writes.
>
> I could remove the writes from the syscall path but it would only save
> about half a ns, and it would make the code less robust. Plus it's nice
> to have the consistency of having *all* pt_regs frames annotated.
This is a bit messy, and I'm not really sure that the entry code
should be have to operate under constraints like this. Also,
convincing myself this works for NMI sounds unpleasant.
Maybe we should go back to my idea of just listing the call sites in a table.