[PATCH 0/4] Fix scalability problem in workqueue watchdog touch caused by stop_machine

From: Nicholas Piggin
Date: Tue Jun 25 2024 - 07:43:07 EST


Here are a few patches to fix a lockup caused by very slow progress due
to a scalability problem in workqueue watchdog touch being hammered by
thousands of CPUs in multi_cpu_stop. Patch 2 is the fix.

I did notice when making a microbenchmark reproducer that the RCU call
was actually also causing slowdowns. Not nearly so bad as the workqueue
touch, but workqueue queueing of dummy jobs slowed down by a factor of
several times when lots of other CPUs were making
rcu_momentary_dyntick_idle() calls. So I did the stop_machine patches to
reduce that. So those patches 3,4 are independent of the first two and
can go in any order.

Thanks,
Nick

Nicholas Piggin (4):
workqueue: wq_watchdog_touch is always called with valid CPU
workqueue: Improve scalability of workqueue watchdog touch
stop_machine: Rearrange multi_cpu_stop state machine loop
stop_machine: Add a delay between multi_cpu_stop touching watchdogs

kernel/stop_machine.c | 31 +++++++++++++++++++++++--------
kernel/workqueue.c | 12 ++++++++++--
2 files changed, 33 insertions(+), 10 deletions(-)

--
2.45.1