Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

From: Petr Mladek

Date: Wed Mar 18 2026 - 14:00:14 EST

On Fri 2026-03-13 10:36:09, Breno Leitao wrote:
> On Fri, Mar 13, 2026 at 03:38:57PM +0100, Petr Mladek wrote:
> > On Fri 2026-03-13 05:24:54, Breno Leitao wrote:
> > > I am currently rolling this patchset to production, and I can report once
> > > I get more information.
> >
> > That would be great. I am really curious what is the root problem here.
>
> In fact, I got some instances of this issue with this new patchset, and,
> still, the backtrace is empty. These are the only 3 issues I got with the new
> patches applied. All of them wiht the "blk_mq_timeout_work" function.
>
> BUG: workqueue lockup - pool 11 cpu 2 curr 686384 (thrmon_agg) stack ffff8002bd200000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 276s!
> work func=blk_mq_timeout_work data=0xffff0000b88e3405
> Showing busy workqueues and worker pools:
> workqueue kblockd: flags=0x18
> pwq 11: cpus=2 node=0 flags=0x0 nice=-20 active=1 refcnt=2
> pending: blk_mq_timeout_work

This is report is showing the stalled "pool 11" in the list of busy
worker pools.

> Showing backtraces of busy workers in stalled CPU-bound worker pools:
>
> BUG: workqueue lockup - pool 7 cpu 1 curr 0 (swapper/1) stack ffff800084f80000 cpus=1 node=0 flags=0x0 nice=-20 stuck for 114s!
> work func=blk_mq_timeout_work data=0xffff0000b88e3205
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
> pwq 510: cpus=127 node=1 flags=0x0 nice=0 active=1 refcnt=2
> pending: psi_avgs_work

It is strange that "pwq 7" is not listed here.

> Showing backtraces of busy workers in stalled CPU-bound worker pools:
>
> BUG: workqueue lockup - pool 11 cpu 2 curr 24596 (mcrcfg-fci) stack ffff8002b5a40000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 282s!
> work func=blk_mq_timeout_work data=0xffff0000b8706805
> Showing busy workqueues and worker pools:

And the list of busy worker pools is even empty here.

> Showing backtraces of busy workers in stalled CPU-bound worker pools:

I would expect that the stalled pool was shown by show_one_workqueue().

show_one_workqueue() checks pwq->nr_active instead of
list_empty(&pool->worklist). But my understanding is that work items
added to pool->worklist should be counted by the related
pwq->nr_active. In fact, pwq->nr_active seems to be decremented
only when the work is proceed or removed from the queue. So that
it should be counted as nr_active even when it is already in progress.
As a result, show_one_workqueue() should print even pools which have
the last assigned work in-flight.

Maybe, I miss something. For example, the barriers are not counted
as nr_active, ...

Anyway, the backtrace of the last woken worker might give us
some pointers. It might show that the pool is stuck on some
wq_barrier or so.

> > In all these cases, there is listed some pending work on the stuck
> > "cpus=XXX". So, it looks more sane than the 1st report.
> >
> > I agree that it looks ugly that it did not print any backtraces.
> > But I am not sure if the backtraces would help.
> >
> > If there is no running worker then wq_worker_sleeping() should wake up
> > another idle worker. And if this is the last idle worker in the
> > per-CPU pool than it should create another worker.
> >
> > Honestly, I think that there is only small chance that the backtraces
> > of the sleeping workers will help to solve the problem.
> >
> > IMHO, the problem is that wq_worker_sleeping() was not able to
> > guarantee forward progress. Note that there should always be
> > at least one idle work on CPU-bound worker pools.
> >
> > Now, the might be more reasons why it failed:
> >
> > 1. It did not wake up any idle worker because it though
> > it has already been done, for example because a messed
> > worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
> > pool->nr_running count.
> >
> > IMHO, the chance of this bug is small.
> >
> >
> > 2. The scheduler does not schedule the woken idle worker because:
> >
> > + too big load
> > + soft/hardlockup on the given CPU
> > + the scheduler does not schedule anything, e.g. because of
> > stop_machine()
> >
> > It seems that this not the case on the 1st example where
> > the CPU is idle. But I am not sure how exactly are the IPIs
> > handled on arm64.
>
> I don't have information about the load of those machines when the problem
> happens, but, in some case the problem happen when there is no workload
> (production job) running on those machine, thus, it is hard to assume that the
> load is high.
>
> > 3. There always must be at least one idle worker in each pool.
> > But the last idle worker newer processes pending work.
> > It has to create another worker instead.
> >
> > create_worker() might fail from more reasons:
> >
> > + worker pool limit (is there any?)
> > + PID limit
> > + memory limit
> >
> > I have personally seen these problems caused by PID limit.
> > Note that containers might have relatively small limits by
> > default !!!
>
> Might this justify the fact that WORK_STRUCT_PENDING bit is set for ~200
> seconds?
>
>
> > I think that it might be interesting to print backtrace and
> > state of the worker which is supposed to guarantee progress.
> > Is it "pool->manager" ?
> >
> > Also create_worker() prints an error when it can't create worker.
> > But the error is printed only once. And it might get lost on
> > huge systems with extensive load and logging.
>
> That is definitely not the case. I've scan Meta's whole fleet for create_worker
> error, and there is a single instance on a unrelated host.

Good to know. I am more and more curious what would be the culprit
here.

Best Regards,
Petr