Re: ppc64le reliable stack unwinder and scheduled tasks
From: Nicolai Stange
Date: Mon Jan 14 2019 - 02:21:45 EST
Joe Lawrence <joe.lawrence@xxxxxxxxxx> writes:
> On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
>> Joe Lawrence <joe.lawrence@xxxxxxxxxx> writes:
>>
>> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>
> [ ... snip ... ]
>
>> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> >> from _switch() helps?
>> >>
>> >> I.e. something like (completely untested):
>> >
>> > I'll kick off some builds tonight and try to get tests lined up
>> > tomorrow. Unfortunately these take a bit of time to run setup, schedule
>> > and complete, so perhaps by next week.
>>
>> Ok, that's probably still a good test for confirmation, but the solution
>> you proposed below is much better.
>
> Good news, 40 tests (RHEL-7 backport) with clearing out the
> STACK_FRAME_MARKER were all successful. Previously I would see stalled
> livepatch transitions due to the unreliable report in ~10% of test
> cases.
>
>> >> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> >> index 435927f549c4..b747d0647ec4 100644
>> >> --- a/arch/powerpc/kernel/entry_64.S
>> >> +++ b/arch/powerpc/kernel/entry_64.S
>> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>> >> SAVE_8GPRS(14, r1)
>> >> SAVE_10GPRS(22, r1)
>> >> std r0,_NIP(r1) /* Return to switch caller */
>> >> +
>> >> + li r23,0
>> >> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
>> >> +
>> >> mfcr r23
>> >> std r23,_CCR(r1)
>> >> std r1,KSP(r3) /* Set old stack pointer */
>> >>
>> >>
>> >
>> > This may be sufficient to avoid the condition, but if the contents of
>> > frame 0 are truely uninitialized (not to be trusted), should the
>> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
>> > aside from the LR and other stack size geometry sanity checks?
>>
>> That's a very good point: we'll only ever be examining scheduled out tasks
>> and current (which at that time is running klp_try_complete_transition()).
>>
>> current won't be in an interrupt/exception when it's walking its own
>> stack. And scheduled out tasks can't have an exception/interrupt frame
>> as their frame 0, correct?
>>
>> Thus, AFAICS, whenever klp_try_complete_transition() finds a
>> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
>> said.
>
> I gave this about 20 tests and they were also all successful, what do
> you think? (The log could probably use some trimming and cleanup.) I
> think either solution would fix the issue.
>
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
>
> From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001
> From: Joe Lawrence <joe.lawrence@xxxxxxxxxx>
> Date: Fri, 11 Jan 2019 10:25:26 -0500
> Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception
> check
>
> The ppc64le reliable stack tracer iterates over a given task's stack,
> verifying a set of conditions to determine whether the trace is
> 'reliable'. These checks include the presence of any exception frames.
>
> We should be careful when inspecting the bottom-most stack frame (the
> first to be unwound), particularly for scheduled-out tasks. As Nicolai
> Stange explains, "If I'm reading the code in _switch() correctly, the
> first frame is completely uninitialized except for the pointer back to
> the caller's stack frame." If a previous do_IRQ() invocation, for
> example, has left a residual exception-marker on the first frame, the
> stack tracer would incorrectly report this task's trace as unreliable.
>
FWIW, it's not been do_IRQ() who wrote the exception marker, but it's
caller hardware_interrupt_common(), more specifically the
EXCEPTION_PROLOG_COMMON_3 part therof.
> save_stack_trace_tsk_reliable() already skips the first frame when
> verifying the saved Link Register. It should do the same when looking
> for the STACK_FRAME_REGS_MARKER. The task it is unwinding will be
> either 'current', in which case the tracer will have never been called
> from an exception path, or the task will be inactive and its first
> frame's contents will be uninitialized (aside from backchain pointer).
I thought about this a little more and can't see anything that would
prevent higher, i.e. non-_switch() frames to also alias with prior
exception frames. That STACK_FRAME_REGS_MARKER is written to a stack
frame's "parameter area" and most functions probably don't initialize
this either. So, AFAICS, higher stack frames could potentially be
affected by the very same problem.
I think the best solution would be to clear the STACK_FRAME_REGS_MARKER
upon exception return. I have a patch ready for that and will post it
after it has passed some basic testing -- hopefully later this day.
That being said, I still think that your patch should also get applied
in some form -- looking at unitialized memory is just not a good thing
to do.
>
> Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
> Signed-off-by: Joe Lawrence <joe.lawrence@xxxxxxxxxx>
> ---
> arch/powerpc/kernel/stacktrace.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
> index e2c50b55138f..0793e75c45a6 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -84,6 +84,12 @@ save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
> EXPORT_SYMBOL_GPL(save_stack_trace_regs);
>
> #ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> +/*
> + * This function returns an error if it detects any unreliable features of the
> + * stack. Otherwise it guarantees that the stack trace is reliable.
> + *
> + * If the task is not 'current', the caller *must* ensure the task is inactive.
> + */
> int
> save_stack_trace_tsk_reliable(struct task_struct *tsk,
> struct stack_trace *trace)
> @@ -143,7 +149,7 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
> return 1;
>
> /* Mark stacktraces with exception frames as unreliable. */
> - if (sp <= stack_end - STACK_INT_FRAME_SIZE &&
> + if (!firstframe && sp <= stack_end - STACK_INT_FRAME_SIZE &&
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER) {
> return 1;
> }
I would perhaps not limit this to the STACK_FRAME_REGS_MARKER, but also
not emit the ip obtained from the first frame into the resulting trace.
I.e., how about moving all the sp/newsp handling to the beginning of the
loop and doing an 'if (firstframe) continue;' right after that?
Thanks,
Nicolai
--
SUSE Linux GmbH, GF: Felix ImendÃrffer, Jane Smithard, Graham Norton,
HRB 21284 (AG NÃrnberg)