Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

From: Petr Mladek

Date: Thu Mar 12 2026 - 13:08:16 EST


On Thu 2026-03-05 08:15:40, Breno Leitao wrote:
> show_cpu_pool_hog() only prints workers whose task is currently running
> on the CPU (task_is_running()). This misses workers that are busy
> processing a work item but are sleeping or blocked — for example, a
> worker that clears PF_WQ_WORKER and enters wait_event_idle().

IMHO, it is misleading. AFAIK, workers clear PF_WQ_WORKER flag only
when they are going to die. They never do so when going to sleep.

> Such a
> worker still occupies a pool slot and prevents progress, yet produces
> an empty backtrace section in the watchdog output.
>
> This is happening on real arm64 systems, where
> toggle_allocation_gate() IPIs every single CPU in the machine (which
> lacks NMI), causing workqueue stalls that show empty backtraces because
> toggle_allocation_gate() is sleeping in wait_event_idle().

The wait_event_idle() called in toggle_allocation_gate() should not
cause a stall. The scheduler should call wq_worker_sleeping(tsk)
and wake up another idle worker. It should guarantee the progress.

> Remove the task_is_running() filter so every in-flight worker in the
> pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
> which is already held.

As I explained in reply to the cover letter, sleeping workers should
not block forward progress. It seems that in this case, the system was
not able to wake up the other idle worker or it was the last idle
worker and was not able to fork a new one.

IMHO, we should warn about this when there is no running worker.
It might be more useful than printing backtraces of the sleeping
workers because they likely did not cause the problem.

I believe that the problem, in this particular situation, is that
the system can't schedule or fork new processes. It might help
to warn about it and maybe show backtrace of the currently
running process on the stalled CPU.

Anyway, I think we could do better here. And blindly printing backtraces
from all workers would do more harm then good in most situations.

Best Regards,
Petr