Re: sched/debug: Dump end of stack when detected corrupted

From: Feng Tang
Date: Tue Sep 03 2024 - 23:01:02 EST


Hi Adrian,

On Tue, Sep 03, 2024 at 06:33:55PM +0200, John Paul Adrian Glaubitz wrote:
> Hi Feng,
>
> > When debugging a kernel hang during suspend/resume, there are random
> > memory corruptions in different places like being detected by scheduler
> > with error message:
> >
> > "Kernel panic - not syncing: corrupted stack end detected inside scheduler"
> >
> > Dump the corrupted memory around the stack end will give more direct
> > hints about how the memory is corrupted:
> >
> > "
> > Corrupted Stack: ff11000122770000: ff ff ff ff ff ff 14 91 82 3b 78 e8 08 00 45 00 .........;x...E.
> > Corrupted Stack: ff11000122770010: 00 1d 2a ff 40 00 40 11 98 c8 0a ef 30 2c 0a ef ..*.@.@.....0,..
> > Corrupted Stack: ff11000122770020: 30 ff a2 00 22 3d 00 09 9a 95 2a 00 00 00 00 00 0..."=....*.....
> > ...
> > Kernel panic - not syncing: corrupted stack end detected inside scheduler
> > "
> >
> > And with it, the culprit was quickly identified to be an ethernet
> > driver with its DMA operations.
> >
> > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
> > ---
> > kernel/sched/core.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a795e030678c..1280f7012bc5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5949,8 +5949,18 @@ static noinline void __schedule_bug(struct task_struct *prev)
> > static inline void schedule_debug(struct task_struct *prev, bool preempt)
> > {
> > #ifdef CONFIG_SCHED_STACK_END_CHECK
> > - if (task_stack_end_corrupted(prev))
> > + if (task_stack_end_corrupted(prev)) {
> > + unsigned long *ptr = end_of_stack(prev);
> > +
> > + /* Dump 16 ulong words around the corruption point */
> > +#ifdef CONFIG_STACK_GROWSUP
> > + ptr -= 15;
> > +#endif
> > + print_hex_dump(KERN_ERR, "Corrupted Stack: ",
> > + DUMP_PREFIX_ADDRESS, 16, 1, ptr, 16 * sizeof(*ptr), 1);
> > +
> > panic("corrupted stack end detected inside scheduler\n");
> > + }
> >
> > if (task_scs_end_corrupted(prev))
> > panic("corrupted shadow stack detected inside scheduler\n");
>
> Have you gotten any feedback on this? Would be nice to get this merged as we're
> seeing crashes due to stack corruption on sparc from time to time and having the
> end of the stack dumped in such cases would make debugging here a bit easier.

Thanks for the review and providing feedback! So far I haven't got response
from maintainers yet.

Hi Peter and maintainers,

Could you help to review this patch which can help debugging those naughty
memory corruption issues? Thanks!

There is a v2 version which can be applied to latest linux-next branch:
https://lore.kernel.org/lkml/20240207143523.438816-1-feng.tang@xxxxxxxxx/

- Feng