Re: [PATCH] workqueue: Handle race between wake up and rebind

From: Neeraj Upadhyay
Date: Tue Jan 16 2018 - 15:08:17 EST




On 01/16/2018 11:05 PM, Tejun Heo wrote:
Hello, Neeraj.

On Mon, Jan 15, 2018 at 02:08:12PM +0530, Neeraj Upadhyay wrote:
- kworker/0:0 gets chance to run on cpu1; while processing
a work, it goes to sleep. However, it does not decrement
pool->nr_running. This is because WORKER_REBOUND (NOT_
RUNNING) flag was cleared, when worker entered worker_
Do you mean that because REBOUND was set?

Actually, I meant REBOUND was not set. Below is the sequence

- cpu0 bounded pool is unbound.

- kworker/0:0 is woken up on cpu1.

- cpu0 pool is rebound
REBOUND is set for kworker/0:0

- kworker/0:0 starts running on cpu1
worker_thread()
// It clears REBOUND and sets nr_running =1 after below call
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);

- kworker/0:0 goes to sleep
wq_worker_sleeping()
// Below condition is not true, as all NOT_RUNNING
// flags were cleared in worker_thread()
if (worker->flags & WORKER_NOT_RUNNING)
// Below is true, as worker is running on cpu1
if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id()))
return NULL;
// Below is not reached and nr_running stays 1
if (atomic_dec_and_test(&pool->nr_running) &&

- kworker/0:0 wakes up again, this time on cpu0, as worker->task
cpus_allowed was set to cpu0, in rebind_workers.
wq_worker_waking_up()
if (!(worker->flags & WORKER_NOT_RUNNING)) {
// Increments pool->nr_running to 2
atomic_inc(&worker->pool->nr_running);


thread().

Worker 0 runs on cpu1
worker_thread()
process_one_work()
wq_worker_sleeping()
if (worker->flags & WORKER_NOT_RUNNING)
return NULL;
if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id()))
<Does not decrement nr_running>

- After this, when kworker/0:0 wakes up, this time on its
bounded cpu cpu0, it increments pool->nr_running again.
So, pool->nr_running becomes 2.
Why is it suddenly 2? Who made it one on the account of the kworker?
As shown in above comment, it became 1 in
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);

Do you see this happening? Or better, is there a (semi) reliable
repro for this issue?
Yes, this was reported in our long run testing with random hotplug.
Sorry, don't have a quick reproducer for it. Issue is reported in few
days of testing.

Thanks.


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation