Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm_cid

From: Mathieu Desnoyers
Date: Fri Apr 07 2023 - 21:14:45 EST


On 2023-04-07 19:50, Mathieu Desnoyers wrote:
[...]

Let's introduce task (Y) which has mm == src_task->mm and task (N) which has
mm != src_task->mm for the rest of the discussion.

Let's consider the scheduling state transitions we want to consider here.
There are two scheduler state transitions on context switch we care about:

(TSA) Store to rq->curr with transition from (N) to (Y)

(TSB) Store to rq->curr with transition from (Y) to (N)

On the migrate-from side, there is one transition we care about:

(TMA) cmpxchg to *pcpu_cid to set the LAZY flag

There is also a transition to UNSET state which can be performed from all
sides (scheduler, migrate-from). It is performed with a cmpxchg everywhere
which guarantees that only a single thread will succeed:

(TMB) cmpxchg to *pcpu_cid to mark UNSET

Just to be clear (at the risk of repeating myself), what we do _not_ want
to happen is a transition to UNSET when a thread is actively using the cid
(property (1)). And ideally we do not want to leak a cid after migrating the
last task from a cpu (property (2)).

Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.

Scenario A) (TSA)+(TMA) (from next task perspective)

CPU0                                                 CPU1

Context switch CS-1                                  Migrate-from
  - store to rq->curr: (N)->(Y) (TSA)                - cmpxchg to *pcpu_id to LAZY (TMA)
     *** missing barrier ?? ***                        (implied barrier after cmpxchg)
  - prepare_task_switch()
    - switch_mm_cid()
      - mm_cid_get (next)
        - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If task (Y) store is observed by rcu_dereference(), it means that there is still
an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
fulfills property (1). That observed task will itself eventually need a migrate-from
to be migrated away from that cpu, which fulfills property (2).

If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
that the old state has the LAZY flag set, only one thread will successfully UNSET.

If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
(Y) and do nothing more, which is fine.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (Y) are observed, which would fail property (1) because this would
UNSET a cid which is actively used.


Scenario B) (TSB)+(TMA) (from prev task perspective)

CPU0                                                 CPU1

Context switch CS-1                                  Migrate-from
  - store to rq->curr: (Y)->(N) (TSB)                - cmpxchg to *pcpu_id to LAZY (TMA)
    *** missing barrier ?? ***                         (implied barrier after cmpxchg)
  - prepare_task_switch()
    - switch_mm_cid()
      - cid_put_lazy() (prev)
        - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (N) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If rcu_dereference observes (N) but LAZY is not observed, migrate-from will take care to
advance the state to UNSET, thus fulfilling property (2). Because (Y) is not running anymore
property (1) is fulfilled.

If rcu_dereference does not observe (N), but LAZY is observed, migrate-from does not
advance to UNSET because it observes (Y), but LAZY flag will make the task on CPU0
take care of advancing the state to UNSET, thus fulfilling property (2).

If both (N) and LAZY are observed, both migrate-from and CPU0 will try to advance the
state to UNSET, but only one will succeed its cmpxchg.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (N) are observed, which would fail property (2) because it would leak
a cid on a cpu that has no task left using the mm.

So based on my analysis, we are indeed missing a barrier between store to rq->curr and
load of the per-mm/cpu cid within context_switch().

Am I missing something ? How can we provide this barrier with minimal overhead ?

Here is the barrier placement I have in mind to minimize the impact:

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..f20fc0600fcc 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
atomic_inc(&mm->mm_count);
}
+static inline void smp_mb__after_mmgrab(void)
+{
+ smp_mb__after_atomic();
+}
+
extern void __mmdrop(struct mm_struct *mm);
static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e0fa4193499..8d410c0dcb39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5117,7 +5117,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
- switch_mm_cid(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5273,6 +5272,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
*
* kernel -> user switch + mmdrop() active
* user -> user switch
+ *
+ * switch_mm_cid() needs to be updated if the barriers provided
+ * by context_switch() are modified.
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
@@ -5302,6 +5304,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
}
}
+ /* switch_mm_cid() requires the memory barriers above. */
+ switch_mm_cid(prev, next);
+
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
prepare_lock_switch(rq, next, rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
{
+ /*
+ * Provide a memory barrier between rq->curr store and load of
+ * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
+ *
+ * Should be adapted if context_switch() is modified.
+ */
+ if (!next->mm) { // to kernel
+ /*
+ * user -> kernel transition does not guarantee a barrier, but
+ * we can use the fact that it performs an atomic operation in
+ * mmgrab().
+ */
+ if (prev->mm) // from user
+ smp_mb__after_mmgrab();
+ /*
+ * kernel -> kernel transition does not change rq->curr->mm
+ * state. It stays NULL.
+ */
+ } else { // to user
+ /*
+ * kernel -> user transition does not provide a barrier
+ * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
+ * Provide it here.
+ */
+ if (!prev->mm) // from kernel
+ smp_mb();
+ /*
+ * user -> user transition guarantees a memory barrier through
+ * switch_mm().
+ */
+ }
if (prev->mm_cid_active) {
mm_cid_put_lazy(prev);
prev->mm_cid = -1;

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com