Re: Workqueues splat due to ending up on wrong CPU

From: Tejun Heo
Date: Tue Dec 03 2019 - 13:14:04 EST


Hello, Paul.

On Tue, Dec 03, 2019 at 09:45:47AM -0800, Paul E. McKenney wrote:
> Good point, and yes, you have told me this before.
>
> Furthermore, in all of these cases, the process was supposed to be
> running on CPU 0, which cannot be taken offline on any of the systems
> under test. Which is leading me to wonder if the workqueue CPU-online
> notifier is sometimes moving more kthreads to the newly onlined CPU than
> it is supposed to. Tejun, could that be happening?

All the warnings that you posted are from rescuers and they jump
around different cpus so that it's on the correct cpu for the specific
work item being rescued. This is a completely separate thing from the
usual worker management and rescuers don't interact with hot[un]plug
callbacks in any way. I think something like the following is what's
happening:

* A work item is queued to CPU5 but it hasn't been dispatched for a
bit so rescuer gets summoned. The rescuer executes the work item
and stays there.

* CPU 5 goes down. The rescuer is asleep and doesn't get affected.

* CPU 5 is coming up. It has online set but the stopper hasn't been
enabled yet.

* A work item was queued on CPU0 but hasn't been dispatched for a
bit, so rescuer is woken up.

* Rescuer wakes up fine on CPU5 as it's online. Seeing the CPU0 work
item, the rescuer tries to migrate to CPU0 by calling
set_cpus_allowed_ptr(); however, stopper isn't up yet and migration
doesn't actually happen.

* Boom. Rescuer is now executing CPU0 work item on CPU5.

Thanks.

--
tejun