ppc64le reliable stack unwinder and scheduled tasks

From: Joe Lawrence
Date: Thu Jan 10 2019 - 16:14:05 EST


Hi all,

tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
about?

I am looking at a bug in which ~10% of livepatch tests on RHEL-7 and
RHEL-8 distro kernels, the ppc64le reliable stack unwinder consistently
reports an unreliable stack for a given task. This condition can
eventually resolve itself, but sometimes this state remains wedged for
hours and I can live-debug the system with crash-utility.

I have tried reproducing with upstream 4.20 kernel, but haven't seen
this exact condition yet. More on upstream in a bit.

That said, I have a question about the ppc64 stack frame layout with
regard to scheduled tasks. In each sticky "unreliable" stack trace
instance that I've looked at, the task's stack has an interesting
frame at the top:


Example 1 (RHEL-7)
==================

crash> struct task_struct.thread c00000022fd015c0 | grep ksp
ksp = 0xc0000000288af9c0

crash> rd 0xc0000000288af9c0 -e 0xc0000000288b0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[0]:

c0000000288af9c0: c0000000288afb90 0000000000dd0000 ...(............
c0000000288af9d0: c000000000002a94 c000000001c60a00 .*..............

crash> sym c000000000002a94
c000000000002a94 (T) hardware_interrupt_common+0x114

c0000000288af9e0: c000000001c60a80 0000000000000000 ................
c0000000288af9f0: c0000000288afbc0 0000000000dd0000 ...(............
c0000000288afa00: c0000000014322e0 c000000001c60a00 ."C.............
c0000000288afa10: c0000002303ae380 c0000002303ae380 ..:0......:0....
c0000000288afa20: 7265677368657265 0000000000002200 erehsger."......

Uh-oh...

/* Mark stacktraces with exception frames as unreliable. */
stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER

c0000000288afa30: c0000000001240ac c0000000288afcb0 .@.........(....
c0000000288afa40: c0000000013e4d00 0000000000000000 .M>.............
c0000000288afa50: 0000000000000001 0000000000000001 ................
c0000000288afa60: c0000000014322e0 0000000000000804 ."C.............
c0000000288afa70: c0000000288ac080 c00000022fd015c0 ...(......./....
c0000000288afa80: c0000000288afae0 00000000ffffffff ...(............
c0000000288afa90: c0000000288afae0 c000000007b80900 ...(............
c0000000288afaa0: c000000000e90a00 c000000000e90a00 ................
c0000000288afab0: c0000000001240ac c000000000e90a00 .@..............
c0000000288afac0: c000000000e90a00 c000000004790580 ..........y.....
c0000000288afad0: 0000000000000001 c000000000e90a00 ................
c0000000288afae0: 0000000000000000 0000000000000004 ................
c0000000288afaf0: c00000022fd01ad0 c000000000e73be0 .../.....;......
c0000000288afb00: c00000023900f450 c000000001488190 P..9......H.....
c0000000288afb10: 0000000000ad0000 c00000023900ef40 ........@..9....
c0000000288afb20: c0000000288ac000 c000000000e73be0 ...(.....;......
c0000000288afb30: c00000000001b514 c00000022fd01628 ........(../....
c0000000288afb40: c0000000288afbb0 c000000000008800 ...(............
c0000000288afb50: c000000000162880 0000000000000000 .(..............
c0000000288afb60: 00000000240f3022 0000000000000004 "0.$............
c0000000288afb70: c000000000e90a00 c00000022fd01a20 ........ ../....
c0000000288afb80: c0000000288afbf0 c0000000014322e0 ...(....."C.....

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[1]:

c0000000288afb90: c0000000288afbf0 c000000001960a00 ...(............
c0000000288afba0: c00000000001b514 0000000000000004 ................

crash> sym c00000000001b514
c00000000001b514 (T) __switch_to+0x264

c0000000288afbb0: c000000000e90a00 0000000000000000 ................
c0000000288afbc0: c0000000288ac000 c0000000014322e0 ...(....."C.....
c0000000288afbd0: c000000000e90a00 c000000001960a00 ................
c0000000288afbe0: c00000022fd015c0 c00000023900ef40 .../....@..9....

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[2]:

c0000000288afbf0: c0000000288afcd0 c000000000003300 ...(.....3......
c0000000288afc00: c000000000a83918 c0000000013e4d00 .9.......M>.....

crash> sym c000000000a83918
c000000000a83918 (t) __schedule+0x428

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[3]:

c0000000288afcd0: c0000000288afd80 0000000000003300 ...(.....3......
c0000000288afce0: c0000000001240ac 0000000000000000 .@..............

crash> sym c0000000001240ac
c0000000001240ac (t) rescuer_thread+0x3ac

[ ... and so on as we would normally expect ... ]


Example 2 - (RHEL-7)
====================

crash> struct task_struct.thread c00000004a660a00 | grep ksp
ksp = 0xc0000005e85439c0,

crash> rd 0xc0000005e8543b90 -e 0xc0000005e8544000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[0]:
c0000005e85439c0: c0000005e8543b90 0000000000000000 .;T.............
c0000005e85439d0: c000000000002a94 0000000000000000 .*..............

crash> sym c000000000002a94
c000000000002a94 (T) hardware_interrupt_common+0x114

c0000005e8543a20: 7265677368657265 c0000000013e4d00 erehsger.M>.....

/* Mark stacktraces with exception frames as unreliable. */
stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[1]:
c0000005e8543b90: c0000005e8543bf0 c000000001960a00 .;T.............
c0000005e8543ba0: c00000000001b514 0000000000000004 ................

crash> sym c00000000001b514
c00000000001b514 (T) __switch_to+0x264

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[2]:
c0000005e8543bf0: c0000005e8543cd0 c000000000003300 .<T......3......
c0000005e8543c00: c000000000a83918 c0000000013e4d00 .9.......M>.....

crash> sym c000000000a83918
c000000000a83918 (t) __schedule+0x428

[ ... and so on as we would normally expect ... ]


Example 3 (upstream 4.20)
=========================

This is not a repro of the sticky "unreliable" stack trace, but just the
stack of a workqueue that fired and called msleep.


crash> struct task_struct.thread c0000002257f0e00 | grep ksp
ksp = 0xc00000022715f9b0,

crash> rd 0xc00000022715f9b0 100

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[0]:

c00000022715f9b0: c00000022715fb90 c00000023f4e8280 ...'......N?....
c00000022715f9c0: c00000022715f9e0 0000000000000003 ...'............

crash> sym c00000022715f9e0
c00000022715f9e0 (B) _end+0x22594f9e0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[1]:

c00000022715fb90: c00000022715fbf0 c00000000158634c ...'....LcX.....
c00000022715fba0: c00000000001ebfc c000000001318a60 ........`.1.....

crash> sym c00000000001ebfc
c00000000001ebfc (T) __switch_to+0x2dc

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[2]:

c00000022715fbf0: c00000022715fcd0 c000000000008800 ...'............
c00000022715fc00: c000000000abcf40 c0000000013abf00 @.........:.....

crash> sym c000000000abcf40
c000000000abcf40 (t) __schedule+0x3b0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[3]:

c00000022715fcd0: c00000022715fd00 2c50fff000000000 ...'..........P,
c00000022715fce0: c000000000abd628 c00000023bff57e8 (........W.;....

crash> sym c000000000abd628
c000000000abd628 (T) schedule+0x48

[ ... etc ... ]


save_stack_trace_tsk_reliable
=============================

arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
take into account the first stackframe, but only to verify that the link
register is indeed pointing at kernel code address.

Can someone explain what __switch_to is doing with the stack and whether
in such circumstances, the reliable stack unwinder should be skipping
the first frame when verifying the frame contents like STACK_FRAME_MARKER, etc.

I may be on the wrong path in debugging this, but figuring out this
sp[0] frame state would be helpful.

Thanks,

-- Joe