Re: [PATCH v10 05/11] arm64: Make dump_stacktrace() use arch_stack_walk()

From: Mark Rutland
Date: Mon Oct 25 2021 - 12:49:35 EST


On Thu, Oct 14, 2021 at 09:58:41PM -0500, madvenka@xxxxxxxxxxxxxxxxxxx wrote:
> From: "Madhavan T. Venkataraman" <madvenka@xxxxxxxxxxxxxxxxxxx>
>
> Currently, dump_stacktrace() in ARM64 code walks the stack using
> start_backtrace() and unwind_frame(). Make it use arch_stack_walk()
> instead. This makes maintenance easier.
>
> Signed-off-by: Madhavan T. Venkataraman <madvenka@xxxxxxxxxxxxxxxxxxx>
> ---
> arch/arm64/kernel/stacktrace.c | 44 +++++-----------------------------
> 1 file changed, 6 insertions(+), 38 deletions(-)
>
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 8982a2b78acf..776c4debb5a7 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -151,24 +151,20 @@ void notrace walk_stackframe(struct task_struct *tsk, struct stackframe *frame,
> }
> NOKPROBE_SYMBOL(walk_stackframe);
>
> -static void dump_backtrace_entry(unsigned long where, const char *loglvl)
> +static bool dump_backtrace_entry(void *arg, unsigned long where)
> {
> + char *loglvl = arg;
> printk("%s %pSb\n", loglvl, (void *)where);
> + return true;
> }
>
> void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk,
> const char *loglvl)
> {
> - struct stackframe frame;
> - int skip = 0;
> -
> pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>
> - if (regs) {
> - if (user_mode(regs))
> - return;
> - skip = 1;
> - }
> + if (regs && user_mode(regs))
> + return;
>
> if (!tsk)
> tsk = current;
> @@ -176,36 +172,8 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk,
> if (!try_get_task_stack(tsk))
> return;
>
> - if (tsk == current) {
> - start_backtrace(&frame,
> - (unsigned long)__builtin_frame_address(0),
> - (unsigned long)dump_backtrace);
> - } else {
> - /*
> - * task blocked in __switch_to
> - */
> - start_backtrace(&frame,
> - thread_saved_fp(tsk),
> - thread_saved_pc(tsk));
> - }
> -
> printk("%sCall trace:\n", loglvl);
> - do {
> - /* skip until specified stack frame */
> - if (!skip) {
> - dump_backtrace_entry(frame.pc, loglvl);
> - } else if (frame.fp == regs->regs[29]) {
> - skip = 0;
> - /*
> - * Mostly, this is the case where this function is
> - * called in panic/abort. As exception handler's
> - * stack frame does not contain the corresponding pc
> - * at which an exception has taken place, use regs->pc
> - * instead.
> - */
> - dump_backtrace_entry(regs->pc, loglvl);
> - }
> - } while (!unwind_frame(tsk, &frame));
> + arch_stack_walk(dump_backtrace_entry, (void *)loglvl, tsk, regs);

This looks really nice!

Unfortunately, removing the `skip` logic higlights a latent issue with
the unwinder (which we previously worked around here but not elsewhere),
whreby we can report erroneous entries when unwinding from regs, because
the fgraph ret stack can have entries added between exception entry and
performing unwind steps.

With this patch as-is, if you enable the function graph tracer, then use
magic-sysrq l, you can see something like:

| sysrq: Show backtrace of all active CPUs
| sysrq: CPU0:
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc4-00005-g9097969cd989 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_cpu_idle+0x1c/0x30
| lr : arch_cpu_idle+0x14/0x30
| sp : ffffc74542823d50
| x29: ffffc74542823d50 x28: 000000004129444c x27: 0000000000000000
| x26: ffffc74542833f00 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffc74542829b10 x22: ffffc7454213ccb8 x21: ffffc745428299e8
| x20: ffffc74542829ae0 x19: ffffc7454212d000 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000001 x12: 0000000002c00ff0
| x11: 0000000002c00000 x10: ffffc74542823d30 x9 : ffff26b8cc7ba600
| x8 : 0000000000000001 x7 : 0000000000000089 x6 : 000000291c72cca9
| x5 : ffff5f74b4953000 x4 : 0000000000004a09 x3 : ffff5f74b4953000
| x2 : 0000000000004a09 x1 : ffff26b9f6a91e20 x0 : 00000000000000e0
| Call trace:
| arch_cpu_idle+0x1c/0x30
| default_idle_call+0x4c/0x19c
| dump_backtrace+0x30/0x3c
| show_regs+0x38/0x50
| rest_init+0xf0/0x100
| arch_call_rest_init+0x1c/0x28
| start_kernel+0x69c/0x6dc
| __primary_switched+0xc0/0xc8

The 'dump_backtrace` and `show_regs` entries don't belong there, and are
hiding the real entries at that part of the callchain.

I think the fix for that is something like the below, which we should
take as a preparatory fix, but as this could trigger warnings in
legitimate usage scenarios we'll need to think a bit harder about the
case of unwinding from regs, and whether the WARN_ON_ONCE() is
warranted.

Thanks,
Mark.

---->8----