Re: [PATCH RFC 2/3] workqueue: trigger a single-CPU backtrace for stalled pools

From: Breno Leitao

Date: Mon Jun 22 2026 - 07:14:22 EST

Hello Petr,

On Fri, Jun 19, 2026 at 03:42:56PM +0200, Petr Mladek wrote:
> It makes some sense. wq_watchdog_timer_fn() checks either
> 'per_cpu(wq_watchdog_touched_cpu)' or the global 'wq_watchdog_touched'
> depending whether pool->cpu is set or not. And it seems to be wrong
> for disassociated pools.
>
> But this seems to be an existing problem which should be fixed
> separately.

Good observation. For disassociated pools (where a CPU has been offlined),
pool->cpu remains set, only the workers' CPU affinity changes.

When a CPU goes offline, the pool becomes disassociated but pool->cpu still
points to the now-offline CPU.

Later in wq_watchdog_timer_fn(), when checking the stalled pool:

if (pool->cpu >= 0)
touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));

This reads wq_watchdog_touched_cpu for the offline CPU, which is still being
updated by wq_watchdog_reset_touched() via for_each_possible_cpu()
(which updates CPU, including offlined CPUs).

Regardless of whether the CPU is online or offline,
wq_watchdog_reset_touched() will mark it as touched.

The real problem is that pool->cpu now names an offline CPU:

- the per-cpu "touched" heartbeat we consult is the wrong one. The pool's
work now runs on online CPUs (it behaves like an unbound pool), so the
global wq_watchdog_touched is the correct grace signal

- the same pool->cpu >= 0 test marks the pool cpu_stall and aims the new
single-CPU backtrace at the offline CPU.

So, I suppose we have a few options:

1) Set pool->cpu to -1 at dissociation time. But, that would lose the
cpu that would be necessary to rebind later. We would need to backup
pool->cpu if we decide to unset it.

int workqueue_online_cpu(unsigned int cpu) {
...
if (pool->cpu == cpu)

2) Treat the pool as cpuless if they are disassociated.

static int pool_watchdog_cpu(struct worker_pool *pool)
{
if (pool->cpu < 0 || (pool->flags & POOL_DISASSOCIATED))
return -1;
return pool->cpu;
}

and replace pool->cpu read by pool_watchdog_cpu() everywhere in the stall
code path. I lean towards 2).

Either way this is unrelated to this patchset, so my suggestion is:

1) I respin this RFC with your Reviewed-by + a cpu_online() check before
triggering the backtrace:

if (!found_running && cpu_online(cpu))
trigger_single_cpu_backtrace(cpu);

2) we continue the disassociated-pool discussion separately, so it does not
block this series.

Thanks,
--breno