5.14-rc1: BUG: workqueue lockup
From: Chris Murphy
Date: Tue Jul 13 2021 - 13:05:02 EST
Hi,
[ 0.000000] kernel: Linux version 5.14.0-0.rc1.16.fc35.x86_64+debug
(mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc (GCC) 11.1.1
20210623 (Red Hat 11.1.1-6), GNU ld version 2.36.1-15.fc35) #1 SMP Mon
Jul 12 14:29:14 UTC 2021
Lenovo Thinkpad X1
Sequence of events: boot seems normal, I never went looking for or
noticed the early splats and lockdep warnings related to bluetooth,
worked for a couple hours, put the laptop in s2idle, resumed work, and
then in Terminal ran 'stress-ng -c8' and immediately the whole system
became unresponsive. Not even the mouse arrow would move. And it
wasn't possible to ssh into the laptop. During the 10 minutes, some
desktop UI did change so the graphical environment was still working
but substantially delayed compared to the inputs.
I've been seeing workqueue lockups in Fedora openQA testing with VM's
that do not have bluetooth. Therefore I think the usb and bluetooth
related splat and lockdep warning early on have nothing to do with the
later workqueue lockup.
dmesg (2 week expiration)
https://pastebin.com/zgkLiSkp
This excerpt is from the full log, just as a marker for when stress-ng
was started:
[ 6448.192901] stress-ng[6238]: invoked with 'stress-n' by user 1000
These are the first kernel message to appear following loss of control
(responsiveness):
[ 6485.133492] kernel: perf: interrupt took too long (2540 > 2500),
lowering kernel.perf_event_max_sample_rate to 78000
[ 6503.012190] kernel: BUG: workqueue lockup - pool cpus=0 node=0
flags=0x0 nice=0 stuck for 54s!
[ 6503.012206] kernel: BUG: workqueue lockup - pool cpus=1 node=0
flags=0x0 nice=0 stuck for 53s!
[ 6503.012213] kernel: BUG: workqueue lockup - pool cpus=2 node=0
flags=0x0 nice=0 stuck for 36s!
[ 6503.012219] kernel: BUG: workqueue lockup - pool cpus=3 node=0
flags=0x0 nice=0 stuck for 53s!
[ 6503.012226] kernel: BUG: workqueue lockup - pool cpus=5 node=0
flags=0x0 nice=0 stuck for 38s!
So roughly 53 seconds before the first BUG is reported by the kernel,
but loss of control happened from the time stress-ng was run.
Kernel config:
https://pastebin.com/QzvEy1sQ
--
Chris Murphy