Re: [PATCH 1/3] sched: select_task_rq() should check cpu_active() like select_fallback_rq()

From: Oleg Nesterov
Date: Wed Oct 14 2015 - 16:08:47 EST


On 10/14, Peter Zijlstra wrote:
>
> On Mon, Oct 12, 2015 at 07:34:02PM +0200, Oleg Nesterov wrote:
> >
> > I simply can't understand... To me it looks as if we can simply remove
> > the cpu_active() check in select_fallback_rq().
> >
> > If we race with cpu_down(), cpu_active() is cleared by sched_cpu_inactive()
> > which is CPU_PRI_SCHED_INACTIVE = INT_MIN + 1 priority, so it seems that
> > only cpuset_cpu_inactive() can be called after that and this check is
> > obviously racy anyway.
>
> Racy yes; however select_task_rq() is called with locks held, therefore
> preemption disabled. This serializes us against the sync_sched() in
> cpu_down().

Yes, I even mentioned this below. And this only means that if you see
cpu_active() == T you know that smpboot_park_threads() can't be called
until preempt_disable().

cpu_active() can become false right after the check, preempt_disable()
can't protect from __cpu_notify(CPU_DOWN_PREPARE).

> And note that smpboot_park_threads() runs until after the sync_sched().
> So you cannot add cpu_active() to select_task_rq() for it must allow
> per-cpu tasks to remain on the cpu.

Yes, yes, this is why this patch triggers BUG_ON() before ->park() in
smpboot_thread_fn().

> > So why we can't simply remove select_fallback_rq()->cpu_active() ?
>
> It would allow migration onto a CPU that's known to be going away. That
> doesn't make sense.

But this is not that bad? This is very unlikely and CPU_DYING will
migrate the woken task again.

> > OK, I don't understand the usage of cpu_active_mask in kernel/sched/,
> > and of course I could easily miss something else. But I doubt very
> > much this check can save us from something bad simply because it is
> > racy.
>
> But safely so; if we observe active, it must stay 'active' until we
> enable preemption again.

I don't undertand what "stay active" actually means here. cpu_active()
is not stable. But probably this doesn't matter.

> > Yes, we also have synchronize_sched() after CPU_DOWN_PREPARE, but
> > the only thing we do before stop_machine() is smpboot_park_threads().
> > So this can help (say) set_cpus_allowed_ptr() which uses active_mask,
> > but I don't see how this can connect to ttwu paths.
>
> Say you have a task A running on CPU4, it goes to sleep. We unplug CPU4.
> We then commence unplugging CPU3, concurrently we wake A. A finds that
> its old CPU (4) is no longer online and ends up in fallback looking for
> a new one.
>
> Without the cpu_active() test in fallback, we could place A on CPU3,
> which is on its way down, just not quite dead.

The same can happen with the cpu_active() check we currently have.
And note again that CPU_PRI_SCHED_INACTIVE == INT_MIN + 1. When
sched_cpu_inactive() clears cpu_active() all other callbacks (except
cpuset_cpu_inactive() were already fired.

> Even if this is not strictly buggy its a daft thing to do and tempting
> fate.

See above...

And if we remove this check from select_fallback_rq() we can revert
dd9d3843755da95f6 "sched: Fix cpu_active_mask/cpu_online_mask race".

And revert 875ebe94 "powerpc/smp: Wait until secondaries are active & online".
And a1307bba "s390/smp: wait until secondaries are active & online".

> > And again. If there is some reason we can not do this, say, because
> > ipi to non-active CPU can trigger a warning, or something else, then
> > we can hit the same problem because select_task_rq() does not check
> > cpu_active(). The kernel threads like stoppers/workers are probably
> > fine in any case, but userspace can trigger the race with cpu_up/down.
>
> So the only 'race' is observing active while we're stuck in
> sync_sched(),

Why? select_task_rq() will see cpu_online() == T after sync_sched().

> which is meant to be safe.

Yes, because that damn cpu_active() check doesn't look strictly necessary ;)
Or I misunderstood.

> It also provides us with a
> known spot after which we're sure no new tasks will come onto our dying
> CPU.

You mean that ttwu(task) can't migrate this task to the dying CPU
after synchronize_rcu() and before stop_machine(), this is true.

OK, lets keep this check if you dislike the idea to remove it. But to me
the very fact that select_task_rq() and select_fallback_rq() use different
checks looks just wrong. Even if everything happens to work.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/