Re: [regression] suspend stress test stalls within 30 minutes

From: Kalle Valo
Date: Fri May 17 2024 - 14:59:13 EST


Dave Hansen <dave.hansen@xxxxxxxxx> writes:

> On 5/17/24 11:37, Kalle Valo wrote:
>> While writing this email I found another way to continue the suspend
>> after a stall: terminate rtcwake with CTRL-C in the ssh session running
>> the for loop. That explains why 'sudo shutdown -h now' makes the suspend
>> go forward, it most likely kills the stalled rtcwake process.
>
> Could we try and figure out what rtcwake is doing during its stall? A
> couple of ideas:
>
> You could strace it to see if it's hung in the kernel:
>
> strace -o strace.log rtcwake ... <args here>
>
> You could look at its stack in /proc, like this:
>
> # cat /proc/`pidof sleep`/stack
> [<0>] hrtimer_nanosleep+0xb5/0x190
> [<0>] common_nsleep+0x44/0x50
> [<0>] __x64_sys_clock_nanosleep+0xcb/0x140
> [<0>] do_syscall_64+0x65/0x140
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
>
> Or you can use sysrq:
>
> echo t > /proc/sysrq-trigger
>
> to get *all* tasks' stacks dumped out to dmesg.
>
> I'd probably do all three in that order.
>
> Getting a function-graph trace of rtcwake during the stall would also be
> nice, but that's a lot of data so let's try the easier things first.

I can do all that but most probably not this week. Luckily it's quite
easy to reproduce the bug, one time I even saw it in the first iteration
and usually within 15 minutes or so.

And do let me know if there's anything else I should try.

--
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches