Re: Workqueues splat due to ending up on wrong CPU

From: Tejun Heo
Date: Tue Nov 26 2019 - 13:33:40 EST


Hello, Paul.

On Mon, Nov 25, 2019 at 03:03:12PM -0800, Paul E. McKenney wrote:
> I am seeing this occasionally during rcutorture runs in the presence
> of CPU hotplug. This is on v5.4-rc1 in process_one_work() at the first
> WARN_ON():
>
> WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> raw_smp_processor_id() != pool->cpu);

Hmm... so it's saying that this worker's pool is supposed to be bound
to a cpu but it's currently running on the wrong cpu.

> What should I do to help further debug this?

Do you always see rescuer_thread in the backtrace? Can you please
apply the following patch and reproduce the problem?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 914b845ad4ff..6f7f185cd146 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1842,13 +1842,18 @@ static struct worker *alloc_worker(int node)
static void worker_attach_to_pool(struct worker *worker,
struct worker_pool *pool)
{
+ int ret;
+
mutex_lock(&wq_pool_attach_mutex);

/*
* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
* online CPUs. It'll be re-applied when any of the CPUs come up.
*/
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ if (ret && !(pool->flags & POOL_DISASSOCIATED))
+ printk("XXX worker pid %d failed to attach to cpus of pool %d, ret=%d\n",
+ task_pid_nr(worker->task), pool->id, ret);

/*
* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -2183,8 +2188,10 @@ __acquires(&pool->lock)
lockdep_copy_map(&lockdep_map, &work->lockdep_map);
#endif
/* ensure we're on the correct CPU */
- WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
- raw_smp_processor_id() != pool->cpu);
+ WARN_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
+ raw_smp_processor_id() != pool->cpu,
+ "expected on cpu %d but on cpu %d, pool %d, workfn=%pf\n",
+ pool->cpu, raw_smp_processor_id(), pool->id, work->func);

/*
* A single work shouldn't be executed concurrently by

--
tejun