Re: Workqueues splat due to ending up on wrong CPU

From: Paul E. McKenney
Date: Tue Dec 03 2019 - 12:45:50 EST


On Tue, Dec 03, 2019 at 11:00:10AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 02, 2019 at 03:39:44PM -0800, Paul E. McKenney wrote:
>
> > I think that I do not understand the code, but I never let that stop
> > me from asking stupid questions! ;-)
> >
> > Suppose that a given worker is bound to a particular CPU, but has no
> > work pending, and is therefore sleeping in the schedule() call near the
> > end of worker_thread(). During this time, its CPU goes offline and then
> > comes back online. Doesn't this break that task's affinity to that CPU?
>
> No. The thing about sleeping tasks is that they're not in fact on any
> CPU at all. Only when a task wakes up do we concern ourselves with
> placing it. If at that time we find the CPU it was constrained to is no
> longer with us, then we go break affinity.
>
> But if the CPU went away and came back while the task was asleep, it
> will not notice anything.

Good point, and yes, you have told me this before.

Furthermore, in all of these cases, the process was supposed to be
running on CPU 0, which cannot be taken offline on any of the systems
under test. Which is leading me to wonder if the workqueue CPU-online
notifier is sometimes moving more kthreads to the newly onlined CPU than
it is supposed to. Tejun, could that be happening?

Thanx, Paul