Re: [PATCH][RFC] workqueue: Fix kernel panic on CPU hot-unplug

From: Helge Deller
Date: Thu Feb 01 2024 - 11:41:40 EST


On 1/31/24 23:28, Tejun Heo wrote:
On Wed, Jan 31, 2024 at 08:27:45PM +0100, Helge Deller wrote:
When hot-unplugging a 32-bit CPU on the parisc platform with
"chcpu -d 1", I get the following kernel panic. Adding a check
for !pwq prevents the panic.

Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000
CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.8.0-rc1-32bit+ #1291
Hardware name: 9000/778/B160L

IASQ: 00000000 00000000 IAOQ: 10446db4 10446db8
IIR: 0f80109c ISR: 00000000 IOR: 00000000
CPU: 1 CR30: 11dd1710 CR31: 00000000
IAOQ[0]: wq_update_pod+0x98/0x14c
IAOQ[1]: wq_update_pod+0x9c/0x14c
RP(r2): wq_update_pod+0x80/0x14c
Backtrace:
[<10448744>] workqueue_offline_cpu+0x1d4/0x1dc
[<10429db4>] cpuhp_invoke_callback+0xf8/0x200
[<1042a1d0>] cpuhp_thread_fun+0xb8/0x164
[<10452970>] smpboot_thread_fn+0x284/0x288
[<1044d8f4>] kthread+0x12c/0x13c
[<1040201c>] ret_from_kernel_thread+0x1c/0x24
Kernel panic - not syncing: Kernel Fault

Signed-off-by: Helge Deller <deller@xxxxxx>

---

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 76e60faed892..dfeee7b7322c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4521,6 +4521,8 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu,
wq_calc_pod_cpumask(target_attrs, cpu, off_cpu);
pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
lockdep_is_held(&wq_pool_mutex));
+ if (!pwq)
+ return;

Hmm... I have a hard time imagining a scenario where some CPUs don't have
pwq installed on wq->cpu_pwq. Can you please run `drgn
tools/workqueue/wq_dump.py` before triggering the hotplug event and paste
the output along with full dmesg?

I'm not sure if parisc is already fully supported with that tool, or
if I'm doing something wrong:

root@debian:~# uname -a
Linux debian 6.8.0-rc1-32bit+ #1292 SMP PREEMPT Thu Feb 1 11:31:38 CET 2024 parisc GNU/Linux

root@debian:~# drgn --main-symbols -s ./vmlinux ./wq_dump.py
Traceback (most recent call last):
File "/usr/bin/drgn", line 33, in <module>
sys.exit(load_entry_point('drgn==0.0.25', 'console_scripts', 'drgn')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/drgn/cli.py", line 301, in _main
runpy.run_path(script, init_globals={"prog": prog}, run_name="__main__")
File "<frozen runpy>", line 291, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "./wq_dump.py", line 78, in <module>
worker_pool_idr = prog['worker_pool_idr']
~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'worker_pool_idr'

Maybe you have an idea? I'll check further, but otherwise it's probably
easier for me to add some printk() to the kernel function wq_update_pod()
and send that info?

Helge