Re: [PATCH 2/4] sched: Document Program-Order guarantees

From: Paul Turner
Date: Mon Nov 02 2015 - 15:27:42 EST


On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> These are some notes on the scheduler locking and how it provides
> program order guarantees on SMP systems.
>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Will Deacon <will.deacon@xxxxxxx>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: David Howells <dhowells@xxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> kernel/sched/core.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 142 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1905,6 +1905,148 @@ static void ttwu_queue(struct task_struc
> raw_spin_unlock(&rq->lock);
> }
>
> +/*
> + * Notes on Program-Order guarantees on SMP systems.
> + *
> + *
> + * PREEMPTION/MIGRATION
> + *
> + * Regular preemption/migration is safe because as long as the task is runnable
> + * migrations involve both rq locks, albeit not (necessarily) at the same time.
> + *
> + * So we get (we allow 3 CPU migrations):
> + *
> + * CPU0 CPU1 CPU2
> + *
> + * LOCK rq(0)->lock
> + * sched-out X
> + * sched-in Y
> + * UNLOCK rq(0)->lock
> + *
> + * LOCK rq(0)->lock // MB against CPU0
> + * dequeue X
> + * UNLOCK rq(0)->lock
> + *
> + * LOCK rq(1)->lock
> + * enqueue X
> + * UNLOCK rq(1)->lock
> + *
> + * LOCK rq(1)->lock // MB against CPU2
> + * sched-out Z
> + * sched-in X
> + * UNLOCK rq(1)->lock
> + *
> + * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0)
> + * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the
> + * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must
> + * observe the state it left behind on CPU0.
> + *

I suspect this part might be more explicitly expressed by specifying
the requirements that migration satisfies; then providing an example.
This makes it easier for others to reason about the locks and saves
worrying about whether the examples hit our 3 million sub-cases.

I'd also propose just dropping preemption from this part, we only need
memory order to be correct on migration, whether it's scheduled or not
[it also invites confusion with the wake-up case].

Something like:
When any task 't' migrates, all activity on its prior cpu [c1] is
guaranteed to be happens-before any subsequent execution on its new
cpu [c2]. There are 3 components to enforcing this.

[c1] 1) Sched-out of t requires rq(c1)->lock
[any cpu] 2) Any migration of t, by any cpu is required to synchronize
on *both* rq(c1)->lock and rq(c2)->lock
[c2] 3) Sched-in of t requires cq(c2)->lock

Transitivity guarantees that (2) orders after (1) and (3) after (2).
Note that in some cases (e.g. active, or idle cpu) the balancing cpu
in (2) may be c1 or c2.

[Follow example]


> + *
> + * BLOCKING -- aka. SLEEP + WAKEUP
> + *
> + * For blocking things are a little more interesting, because when we dequeue
> + * the task, we don't need to acquire the old rq lock in order to migrate it.
> + *
> + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
> + * to CPU2 (the most complex example):
> + *
> + * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
> + *
> + * X->state = UNINTERRUPTIBLE
> + * MB
> + * if (cond)
> + * break
> + * cond = true
> + *
> + * LOCK rq(0)->lock LOCK X->pi_lock
> + *
> + * dequeue X
> + * while (X->on_cpu)
> + * cpu_relax()
> + * sched-out X
> + * RELEASE
> + * X->on_cpu = 0
> + * RMB
> + * X->state = WAKING
> + * set_task_cpu(X,2)
> + * WMB
> + * ti(X)->cpu = 2
> + *
> + * llist_add(X, rq(2)) // MB
> + * llist_del_all() // MB
> + *
> + * LOCK rq(2)->lock
> + * enqueue X
> + * X->state = RUNNING
> + * UNLOCK rq(2)->lock
> + *
> + * LOCK rq(2)->lock
> + * sched-out Z
> + * sched-in X
> + * UNLOCK rq(1)->lock
> + *
> + * if (cond) // _TRUE_
> + * UNLOCK X->pi_lock
> + * UNLOCK rq(0)->lock
> + *
> + * So in this case the scheduler does not provide an obvious full barrier; but
> + * the smp_store_release() in finish_lock_switch(), paired with the control-dep
> + * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
> + * order things between CPU0 and CPU1.
> + *
> + * The llist primitives order things between CPU1 and CPU2 -- the alternative
> + * is CPU1 doing the remote enqueue (the alternative path in ttwu_queue()) in
> + * which case the rq(2)->lock release/acquire will order things between them.

I'd vote for similar treatment here. We can make the dependency on
on_cpu much more obvious, something that is a reasonably subtle detail
today.

> + *
> + * Which again leads to the guarantee that by the time X gets to run on CPU2
> + * it must observe the state it left behind on CPU0.
> + *
> + * However; for blocking there is a second guarantee we must provide, namely we
> + * must observe the state that lead to our wakeup. That is, not only must X
> + * observe its own prior state, it must also observe the @cond store.
> + *
> + * This too is achieved in the above, since CPU1 does the waking, we only need
> + * the ordering between CPU1 and CPU2, which is the same as the above.
> + *
> + *
> + * There is however a much more interesting case for this guarantee, where X
> + * never makes it off CPU0:
> + *
> + * CPU0 (schedule) CPU1 (try_to_wake_up)
> + *
> + * X->state = UNINTERRUPTIBLE
> + * MB
> + * if (cond)
> + * break
> + * cond = true
> + *
> + * WMB (aka smp_mb__before_spinlock)
> + * LOCK X->pi_lock
> + *
> + * if (X->on_rq)
> + * LOCK rq(0)->lock
> + * X->state = RUNNING
> + * UNLOCK rq(0)->lock
> + *
> + * LOCK rq(0)->lock // MB against CPU1
> + * UNLOCK rq(0)->lock
> + *
> + * if (cond) // _TRUE_
> + *
> + * UNLOCK X->pi_lock
> + *
> + * Here our task X never quite leaves CPU0, the wakeup happens before we can
> + * dequeue and schedule someone else. In this case we must still observe cond
> + * after our call to schedule() completes.
> + *
> + * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store
> + * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order
> + * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot
> + * happen before this same LOCK.
> + *
> + * Therefore, again, we're good.
> + */
> +
> /**
> * try_to_wake_up - wake up a thread
> * @p: the thread to be awakened
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/