workqueue lockup debug

From: John Garry
Date: Thu Oct 24 2024 - 11:50:44 EST


Hi workqueue and scheduler maintainers,

As reported in https://lore.kernel.org/linux-fsdevel/df9db1ce-17d9-49f1-ab6d-7ed9a4f1f9c0@xxxxxxxxxx/T/#m506b9edb1340cdddd87c6d14d20222ca8d7e8796, I am experiencing a workqueue lockup for v6.12-rcX.

At the point it occurs, the system becomes unresponsive and I cannot bring it back to life.

Enabling /proc/sys/kernel/softlockup_all_cpu_backtrace does not give anything extra in the way of debug. All I get is something like this:

Message from syslogd@jgarry-atomic-write-exp-e4-8-instance-20231214-1221 at Oct 24 15:34:02 ...
kernel:watchdog: BUG: soft lockup - CPU#29 stuck for 22s! [mysqld:14352]

Message from syslogd@jgarry-atomic-write-exp-e4-8-instance-20231214-1221 at Oct 24 15:34:02 ...
kernel:BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 30s!

Message from syslogd@jgarry-atomic-write-exp-e4-8-instance-20231214-1221 at Oct 24 15:34:02 ...
kernel:BUG: workqueue lockup - pool cpus=31 node=0 flags=0x0 nice=0 stuck for 49s!
^C

Can you advise on a robust method to get some debug from this system?

Maybe this is a scheduler issue, as Dave mentioned in that same thread.

Thanks,
John