Re: [PATCH] arm64: stacktrace: Stop unwinding when the PC is zero

From: Mark Rutland
Date: Thu Apr 29 2021 - 06:48:33 EST


Hi Leo,

On Thu, Apr 29, 2021 at 09:43:21AM +0800, Leo Yan wrote:
> When use ftrace for stack trace, it reports the spurious frame with the
> PC value is zero. This can be reproduced with commands:
>
> # cd /sys/kernel/debug/tracing/
> # echo "prev_pid == 0" > events/sched/sched_switch/filter
> # echo stacktrace > events/sched/sched_switch/trigger
> # echo 1 > events/sched/sched_switch/enable
> # cat trace
>
> <idle>-0 [005] d..2 259.621390: sched_switch: ...
> <idle>-0 [005] d..3 259.621394: <stack trace>
> => __schedule
> => schedule_idle
> => do_idle
> => cpu_startup_entry
> => secondary_start_kernel
> => 0

IIUC, this is my fault, and is an unintended side-effect of commit:

6106e1112cc69a36 ("arm64: remove EL0 exception frame record")

... since before prior to that, we'd implicitly create a terminal record
in start_kernel and secondary_start_kernel by virtue of entering those
functions with both FP and LR set to NULL. After that commit, we report
the NULL LR before trying to unwind the NULL FP.

> The kernel initializes FP/PC values as zero for swapper threads in
> head.S, when walk the stack frame, this patch stops unwinding if detect
> the PC value is zero, therefore can avoid the spurious frame.
>
> Below is the stacktrace after applying the change:
>
> # cat trace
>
> <idle>-0 [005] d..2 259.621390: sched_switch: ...
> <idle>-0 [005] d..3 259.621394: <stack trace>
> => __schedule
> => schedule_idle
> => do_idle
> => cpu_startup_entry
> => secondary_start_kernel
>
> Signed-off-by: Leo Yan <leo.yan@xxxxxxxxxx>
> ---
> arch/arm64/kernel/stacktrace.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 84b676bcf867..02b1e85b2026 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -145,7 +145,11 @@ void notrace walk_stackframe(struct task_struct *tsk, struct stackframe *frame,
> if (!fn(data, frame->pc))
> break;
> ret = unwind_frame(tsk, frame);
> - if (ret < 0)
> + /*
> + * When the frame->pc is zero, it has reached to the initial pc
> + * and fp values; stop unwinding for this case.
> + */
> + if (ret < 0 || !frame->pc)
> break;

I don't think this is the right place for this, since we intend
unwind_frame() to detect when unwinding is finished; see commit:

3c02600144bdb0a1 ("arm64: stacktrace: Report when we reach the end of the stack")

I think we have three options for what to do here:

a) Revert 6106e1112cc69a36, and identify these cases as terminal records
where FP and LR are both NULL.

b) Have __primary_switched and __secondary_switched call start_kernel
and secondary_start_kernel with BL rather than B. The __*_switched
functions will show up in the trace, but we won't unwind any further
as the next record will have a NULL FP.

c) Revert 6106e1112cc69a36, create terminal records in
__primary_switched and __secondary_switched, and call start_kernel
and secondary_start_kernel with BL rather than B. The __*_switched
functions will show up in the trace, but we won't unwind any further
as the next record will be a terminal record.

For RELIABLE_STACKTRACE, we're going to have to do (c), I think, but for
now we could do (a) so as to have a minimal fix, and we can build (c)
atop that.

How about the patch below? I've tested it with your instructions and
also by inspecting /proc/self/stack.

Thanks,
Mark.

---->8----