Re: BUG: workqueue lockup - SRCU schedules work on not-online CPUs during size transition
From: Srikar Dronamraju
Date: Wed Apr 29 2026 - 11:01:04 EST
* Tejun Heo <tj@xxxxxxxxxx> [2026-04-10 08:53:30]:
Hi Tejun,
[ copying Samir Mulani to this thread ]
> Hello,
>
> > Seems that we (mostly Paul) have our own trick to track whether a CPU
> > has ever been onlined in RCU, see rcu_cpu_beenfullyonline(). Paul also
> > used it in his fix [1]. And I think it won't be that hard to copy it
> > into workqueue and let queue_work_on() use it so that if the user queues
> > a work on a never-onlined CPU, it can detect it (with a warning?) and do
> > something?
>
> The easiest way to do this is just creating the initial workers for all
> possible pools. Please see below. However, the downside is that it's going
> to create all workers for all possible cpus. This isn't a problem for
> anybody else but these IBM mainframes often come up with a lot of possible
> but not-yet-or-ever-online CPUs for capacity management, so the cost may not
> be negligible on some configurations.
>
> IBM folks, is that okay?
Even on PowerPC LPARS, its not uncommon to have possible cpus != online cpus
at boot. However your approach will work.
And Samir has already tested the same too and reported here
https://lkml.kernel.org/r/1b89c25b-7c1d-4ed8-adf3-ac504b6f086a@xxxxxxxxxxxxx
>
> Also, why do you need to queue work items on an offline CPU? Do they
> actually have to be per-cpu? Can you get away with using an unbound
> workqueue?
>
> Thanks.
>
> From: Tejun Heo <tj@xxxxxxxxxx>
> Subject: workqueue: Create workers for all possible CPUs on init
>
> Per-CPU worker pools are initialized for every possible CPU during early boot,
> but workqueue_init() only creates initial workers for online CPUs. On systems
> where possible CPUs outnumber online CPUs (e.g. s390 LPARs with 76 online and
> 400 possible CPUs), the pools for never-onlined CPUs have POOL_DISASSOCIATED
> set but no workers. Any work item queued on such a CPU hangs indefinitely.
>
> This was exposed by 61bbcfb50514 ("srcu: Push srcu_node allocation to GP when
> non-preemptible") which made SRCU schedule callbacks on all possible CPUs
> during size transitions, triggering workqueue lockup warnings for all
> never-onlined CPUs.
>
> Create workers for all possible CPUs during init, not just online ones. For
> online CPUs, the behavior is unchanged - POOL_DISASSOCIATED is cleared and the
> worker is bound to the CPU. For not-yet-online CPUs, POOL_DISASSOCIATED
> remains set, so worker_attach_to_pool() marks the worker UNBOUND and it can
> execute on any CPU. When the CPU later comes online, rebind_workers() handles
> the transition to associated operation as usual.
>
With these patch, if a CPU has been onlined once, it's should be ok to queue
the work on that CPU even if its offline now.
> Reported-by: Vasily Gorbik <gor@xxxxxxxxxxxxx>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Boqun Feng <boqun@xxxxxxxxxx>
> Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
Reviewed-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxx>
> ---
> kernel/workqueue.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -8068,9 +8068,10 @@ void __init workqueue_init(void)
> for_each_bh_worker_pool(pool, cpu)
> BUG_ON(!create_worker(pool));
>
> - for_each_online_cpu(cpu) {
> + for_each_possible_cpu(cpu) {
> for_each_cpu_worker_pool(pool, cpu) {
> - pool->flags &= ~POOL_DISASSOCIATED;
> + if (cpu_online(cpu))
> + pool->flags &= ~POOL_DISASSOCIATED;
> BUG_ON(!create_worker(pool));
> }
> }
> --
> tejun
--
Thanks and Regards
Srikar Dronamraju