Re: [PATCH v2] workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works

From: Matthew Brost

Date: Wed Apr 01 2026 - 11:47:30 EST

On Wed, Apr 01, 2026 at 10:44:55AM -0400, Waiman Long wrote:
> On 3/31/26 9:07 PM, Matthew Brost wrote:
> > In unplug_oldest_pwq(), the first inactive work item on the
> > pool_workqueue is activated correctly. However, if multiple inactive
> > works exist on the same pool_workqueue, subsequent works fail to
> > activate because wq_node_nr_active.pending_pwqs is empty — the list
> > insertion is skipped when the pool_workqueue is plugged.
> >
> > Fix this by checking for additional inactive works in
> > unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs
> > accordingly.
> >
> > v2:
> > - Use pwq_activate_first_inactive(pwq, false) rather than open coding
> > list operations (Tejun)
> >
> > Cc: Carlos Santa <carlos.santa@xxxxxxxxx>
> > Cc: Ryan Neph <ryanneph@xxxxxxxxxx>
> > Cc: stable@xxxxxxxxxxxxxxx
> > Cc: Tejun Heo <tj@xxxxxxxxxx>
> > Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
> > Cc: Waiman Long <longman@xxxxxxxxxx>
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues")
> > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> >
> > ---
> >
> > This bug was first reported by Google, where the Xe driver appeared to
> > hang due to a fencing signal not completing. We traced the issue to work
> > items not being scheduled, and it can be trivially reproduced on drm-tip
> > with the following commands:
> >
> > shell0:
> > for i in {1..100}; do echo "Run $i"; xe_exec_threads --r \
> > threads-rebind-bindexecqueue; done
> >
> > shell1:
> > for i in {1..1000}; do echo "toggle $i"; echo f > \
> > /sys/devices/virtual/workqueue/cpumask; echo ff > \
> > /sys/devices/virtual/workqueue/cpumask; echo fff > \
> > /sys/devices/virtual/workqueue/cpumask ; echo ffff > \
> > /sys/devices/virtual/workqueue/cpumask; sleep .1; done
> > ---
> > kernel/workqueue.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index b77119d71641..bee3f37fffde 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -1849,8 +1849,17 @@ static void unplug_oldest_pwq(struct workqueue_struct *wq)
> > raw_spin_lock_irq(&pwq->pool->lock);
> > if (pwq->plugged) {
> > pwq->plugged = false;
> > - if (pwq_activate_first_inactive(pwq, true))
> > + if (pwq_activate_first_inactive(pwq, true)) {
> > + /*
> > + * pwq is unbound. Additional inactive work_items need
> > + * to reinsert the pwq into nna->pending_pwqs, which
> > + * was skipped while pwq->plugged was true. See
> > + * pwq_tryinc_nr_active() for additional details.
> > + */
> > + pwq_activate_first_inactive(pwq, false);
> > +
> > kick_pool(pwq->pool);
> > + }
> > }
> > raw_spin_unlock_irq(&pwq->pool->lock);
> > }
>
> Thanks for fixing this bug. However, calling pwq_activate_first_inactive

No problem — I think this one has been lurking around for a while, and
we’ve just papered over it in Xe for a couple of years.

> twice can be a bit hard to understand. Will modifying pwq_tryinc_nr_active()

I actually think it makes quite a bit of sense, as it matches what
__queue_work does if two items are added back-to-back on an ordered
workqueue — the first one updates the nr_active counts and activates,
and the second one updates the pending_pwqs.

> like the following works?
>

My initial thought was that your snippet should work — in fact, it does
for a while (drm-tip hangs almost immediately), but eventually I do get
a hang when running my reproducer, whereas with this patch I don’t. I
can’t reason exactly why — maybe it’s because
node_activate_pending_pwq() can find a plugged pwq, but that’s just a
guess.

Matt

> Thanks,
> Longman
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index b77119d71641..b35e6e62e474 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1738,9 +1738,6 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
> goto out;
> }
> - if (unlikely(pwq->plugged))
> - return false;
> -
> /*
> * Unbound workqueue uses per-node shared nr_active $nna. If @pwq is
> * already waiting on $nna, pwq_dec_nr_active() will maintain the
> @@ -1749,13 +1746,19 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
> * We need to ignore the pending test after max_active has increased as
> * pwq_dec_nr_active() can only maintain the concurrency level but not
> * increase it. This is indicated by @fill.
> + *
> + * If @pwq is plugged, we need to make sure that it is linked to a
> + * pending_pwqs of a $nna.
> + *
> */
> - if (!list_empty(&pwq->pending_node) && likely(!fill))
> + if (!list_empty(&pwq->pending_node) && likely(!fill || pwq->plugged))
> goto out;
> - obtained = tryinc_node_nr_active(nna);
> - if (obtained)
> - goto out;
> + if (likely(!pwq->plugged)) {
> + obtained = tryinc_node_nr_active(nna);
> + if (obtained)
> + goto out;
> + }
> /*
> * Lockless acquisition failed. Lock, add ourself to $nna->pending_pwqs
> @@ -1773,7 +1776,8 @@ static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
> smp_mb();
> - obtained = tryinc_node_nr_active(nna);
> + if (likely(!pwq->plugged))
> + obtained = tryinc_node_nr_active(nna);
> /*
> * If @fill, @pwq might have already been pending. Being spuriously
>