Re: [PATCH 1/3] sched: select_task_rq() should check cpu_active() like select_fallback_rq()

From: Peter Zijlstra
Date: Wed Oct 14 2015 - 11:00:36 EST


On Mon, Oct 12, 2015 at 07:34:02PM +0200, Oleg Nesterov wrote:
> On 10/12, Peter Zijlstra wrote:
> >
> > On Sat, Oct 10, 2015 at 08:53:09PM +0200, Oleg Nesterov wrote:
> > > I do not understand the cpu_active() check in select_fallback_rq().
> > > x86 doesn't need it, and the recent commit dd9d3843755d "sched: Fix
> > > cpu_active_mask/cpu_online_mask race" documents the fact that on any
> > > architecture we can ignore !active starting from CPU_ONLINE stage.
> > >
> > > But any possible reason why do we need this check in "fallback" must
> > > equally apply to select_task_rq().
> >
> > So the reason, from vague memory, is that we want to allow per-cpu
> > threads to start/stop before/after active.
>
> I simply can't understand... To me it looks as if we can simply remove
> the cpu_active() check in select_fallback_rq().
>
> If we race with cpu_down(), cpu_active() is cleared by sched_cpu_inactive()
> which is CPU_PRI_SCHED_INACTIVE = INT_MIN + 1 priority, so it seems that
> only cpuset_cpu_inactive() can be called after that and this check is
> obviously racy anyway.

Racy yes; however select_task_rq() is called with locks held, therefore
preemption disabled. This serializes us against the sync_sched() in
cpu_down().

And note that smpboot_park_threads() runs until after the sync_sched().
So you cannot add cpu_active() to select_task_rq() for it must allow
per-cpu tasks to remain on the cpu.

Also, I think smpboot_park_threads() is called too early, we can still
run random tasks at that point which might rely on having the per-cpu
tasks around. But we cannot run it later because once the stopper thread
runs, the per-cpu threads cannot schedule to park themselves :/

Lets keep an eye on Thomas' rework to see if this gets sorted.

> So why we can't simply remove select_fallback_rq()->cpu_active() ?

It would allow migration onto a CPU that's known to be going away. That
doesn't make sense.

> > active 'should' really only govern load-balancer bits or so.
>
> OK, I don't understand the usage of cpu_active_mask in kernel/sched/,
> and of course I could easily miss something else. But I doubt very
> much this check can save us from something bad simply because it is
> racy.

But safely so; if we observe active, it must stay 'active' until we
enable preemption again.

The whole point of active is to limit the load-balancer; such that we
can rebuild the sched-domains after the fact. We used to have to rebuild
to the sched-domains early, which got into trouble (forgot details,
again).

And we had to have a new mask, because online was too late, there was
(forgot details) a state where we were still online but new are not
welcome.

Also, it makes sense to stop accepting new tasks before you take out the
old ones.

> Yes, we also have synchronize_sched() after CPU_DOWN_PREPARE, but
> the only thing we do before stop_machine() is smpboot_park_threads().
> So this can help (say) set_cpus_allowed_ptr() which uses active_mask,
> but I don't see how this can connect to ttwu paths.

Say you have a task A running on CPU4, it goes to sleep. We unplug CPU4.
We then commence unplugging CPU3, concurrently we wake A. A finds that
its old CPU (4) is no longer online and ends up in fallback looking for
a new one.

Without the cpu_active() test in fallback, we could place A on CPU3,
which is on its way down, just not quite dead.

Even if this is not strictly buggy its a daft thing to do and tempting
fate.

> And again. If there is some reason we can not do this, say, because
> ipi to non-active CPU can trigger a warning, or something else, then
> we can hit the same problem because select_task_rq() does not check
> cpu_active(). The kernel threads like stoppers/workers are probably
> fine in any case, but userspace can trigger the race with cpu_up/down.

So the only 'race' is observing active while we're stuck in
sync_sched(), which is meant to be safe. It also provides us with a
known spot after which we're sure no new tasks will come onto our dying
CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/