[PATCH] sched/debug: Dump end of stack when detected corrupted

From: Feng Tang
Date: Mon Dec 18 2023 - 22:15:04 EST


When debugging a kernel hang during suspend/resume, there are random
memory corruptions in different places like being detected by scheduler
with error message:

"Kernel panic - not syncing: corrupted stack end detected inside scheduler"

Dump the corrupted memory around the stack end will give more direct
hints about how the memory is corrupted:

"
Corrupted Stack: ff11000122770000: ff ff ff ff ff ff 14 91 82 3b 78 e8 08 00 45 00 .........;x...E.
Corrupted Stack: ff11000122770010: 00 1d 2a ff 40 00 40 11 98 c8 0a ef 30 2c 0a ef ..*.@.@.....0,..
Corrupted Stack: ff11000122770020: 30 ff a2 00 22 3d 00 09 9a 95 2a 00 00 00 00 00 0..."=....*.....
...
Kernel panic - not syncing: corrupted stack end detected inside scheduler
"

And with it, the culprit was quickly identified to be an ethernet
driver with its DMA operations.

Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
---
kernel/sched/core.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a795e030678c..1280f7012bc5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5949,8 +5949,18 @@ static noinline void __schedule_bug(struct task_struct *prev)
static inline void schedule_debug(struct task_struct *prev, bool preempt)
{
#ifdef CONFIG_SCHED_STACK_END_CHECK
- if (task_stack_end_corrupted(prev))
+ if (task_stack_end_corrupted(prev)) {
+ unsigned long *ptr = end_of_stack(prev);
+
+ /* Dump 16 ulong words around the corruption point */
+#ifdef CONFIG_STACK_GROWSUP
+ ptr -= 15;
+#endif
+ print_hex_dump(KERN_ERR, "Corrupted Stack: ",
+ DUMP_PREFIX_ADDRESS, 16, 1, ptr, 16 * sizeof(*ptr), 1);
+
panic("corrupted stack end detected inside scheduler\n");
+ }

if (task_scs_end_corrupted(prev))
panic("corrupted shadow stack detected inside scheduler\n");
--
2.27.0