Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

From: Breno Leitao

Date: Fri Mar 13 2026 - 08:26:26 EST


Hello Petr,

On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote:
> On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> > There is a blind spot exists in the work queue stall detecetor (aka
> > show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> > true, so a busy worker that is sleeping (e.g. wait_event_idle())
> > produces an empty backtrace section even though it is the cause of the
> > stall.
> >
> > Additionally, when the watchdog does report stalled pools, the output
> > doesn't show how long each in-flight work item has been running, making
> > it harder to identify which specific worker is stuck.
> >
> > Example of the sample code:
> >
> > BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
> > Showing busy workqueues and worker pools:
> > workqueue events: flags=0x100
> > pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
> > in-flight: 178:stall_work1_fn [wq_stall]
> > pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> > ...
> > Showing backtraces of running workers in stalled
> > CPU-bound worker pools:
> > <nothing here>
> >
> > I see it happening on real machines, causing some stalls that doesn't
> > have any backtrace. This is one of the code path:
> >
> > 1) kfence executes toggle_allocation_gate() as a delayed workqueue
> > item (kfence_timer) on the system WQ.
> >
> > 2) toggle_allocation_gate() enables a static key, which IPIs every
> > CPU to patch code:
> > static_branch_enable(&kfence_allocation_key);
> >
> > 3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
> > kfence allocation to occur:
> > wait_event_idle(allocation_wait,
> > atomic_read(&kfence_allocation_gate) > 0 || ...);
> >
> > This can last indefinitely if no allocation goes through the
> > kfence path (or IPIing all the CPUs take longer, which is common on
> > platforms that do not have NMI).
> >
> > The worker remains in the pool's busy_hash
> > (in-flight) but is no longer task_is_running().
> >
> > 4) The workqueue watchdog detects the stall and calls
> > show_cpu_pool_hog(), which only prints backtraces for workers
> > that are actively running on CPU:
> >
> > static void show_cpu_pool_hog(struct worker_pool *pool) {
> > ...
> > if (task_is_running(worker->task))
> > sched_show_task(worker->task);
> > }
> >
> > 5) Nothing is printed because the offending worker is in TASK_IDLE
> > state. The output shows "Showing backtraces of running workers in
> > stalled CPU-bound worker pools:" followed by nothing, effectively
> > hiding the actual culprit.
>
> I am trying to better understand the situation. There was a reason
> why only the worker in the running state was shown.
>
> Normally, a sleeping worker should not cause a stall. The scheduler calls
> wq_worker_sleeping() which should wake up another idle worker. There is
> always at least one idle worker in the poll. It should start processing
> the next pending work. Or it should fork another worker when it was
> the last idle one.

Right, but let's look at this case:

BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
Showing busy workqueues and worker pools:
workqueue events_unbound: flags=0x2
pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
in-flight: 4083734:btrfs_extent_map_shrinker_worker
workqueue mm_percpu_wq: flags=0x8
pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
Showing backtraces of running workers in stalled CPU-bound worker pools:
# Nothing in here

It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
679s ?

> I wonder what blocked the idle worker from waking or forking
> a new worker. Was it caused by the IPIs?

Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
IPIs are just regular interrupts that could take a long time to be handled. The
toggle_allocation_gate() was good example, given it was sending IPIs very
frequently and I took it as an example for the cover letter, but, this problem
also show up with diferent places. (more examples later)

> Did printing the sleeping workers helped to analyze the problem?

That is my hope. I don't have a reproducer other than the one in this
patchset.

I am currently rolling this patchset to production, and I can report once
I get more information.

> I wonder if we could do better in this case. For example, warn
> that the scheduler failed to wake up another idle worker when
> no worker is in the running state. And maybe, print backtrace
> of the currently running process on the given CPU because it
> likely blocks waking/scheduling the idle worker.

I am happy to improve this, given this has been a hard issue. let me give more
instances of the "empty" stalls I am seeing. All with empty backtraces:

# Instance 1
BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
pending: 3*psi_avgs_work
pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
in-flight: 842:key_garbage_collector
workqueue mm_percpu_wq: flags=0x8
pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627
Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 2
BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: psi_avgs_work
pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
workqueue events_freezable: flags=0x4
pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: pci_pme_list_scan
workqueue slub_flushwq: flags=0x8
pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
pending: flush_cpu_slab BAR(7520)
workqueue mm_percpu_wq: flags=0x8
pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
pending: cmd_work_handler
Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 3
BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue mm_percpu_wq: flags=0x8
pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 4
BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
pending: psi_avgs_work, fuse_check_timeout
workqueue events_freezable: flags=0x4
pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: pci_pme_list_scan
workqueue mm_percpu_wq: flags=0x8
pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
Showing backtraces of running workers in stalled CPU-bound worker pools:

Thanks for your help,
--breno