Re: [RFC PATCH] sched: Fix performance regression introduced by mm_cid (v2)

From: Mathieu Desnoyers
Date: Wed Apr 05 2023 - 10:22:39 EST


On 2023-04-05 08:57, Peter Zijlstra wrote:
On Wed, Apr 05, 2023 at 08:15:35AM -0400, Mathieu Desnoyers wrote:
+/*
+ * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
+ * that the rq lock is held.
+ */
+void sched_mm_cid_migrate_from(struct task_struct *t)
+{
+ int src_cid, *src_pcpu_cid, last_mm_cid;
+ struct mm_struct *mm = t->mm;
+ struct rq *src_rq;
+ struct task_struct *src_task;
+
+ if (!mm)
+ return;
+
+ last_mm_cid = t->last_mm_cid;
+ /*
+ * If the migrated task has no last cid, or if the current
+ * task on src rq uses the cid, it means the destination cpu
+ * does not have to reallocate its cid to keep the cid allocation
+ * compact.
+ */
+ if (last_mm_cid == -1)
+ return;
+
+ src_rq = task_rq(t);
+ src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
+ src_cid = READ_ONCE(*src_pcpu_cid);
+
+ if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
+ return;
+
+ /*
+ * If we observe an active task using the mm on this rq, it means we
+ * are not the last task to be migrated from this cpu for this mm, so
+ * there is no need to clear the src_cid.
+ */
+ rcu_read_lock();
+ src_task = rcu_dereference(src_rq->curr);

Continuing our discussion from IRC; your concern was if we need a
barrier near RCU_INIT_POINTER() in __schedule(). Now, typically such a
site would use rcu_assign_pointer() and be a store-release, which is
ommitted in this case.

Specifically as commit 5311a98fef7d argues, there's at least one barrier
in between most fields being set and this assignment.

On top of that, the below only has the ->mm dependent load, and task->mm
is fairly constant. The obvious exception being kthread_use_mm().

+ if (src_task->mm_cid_active && src_task->mm == mm) {
+ rcu_read_unlock();
+ t->last_mm_cid = -1;
+ return;
+ }
+ rcu_read_unlock();

So if we get here, then rq->curr->mm was observed to not match t->mm.
However, nothing stops the rq from switching to a task that does match
right here.

+
+ /*
+ * If the source cpu cid is set, and matches the last cid of the
+ * migrated task, clear the source cpu cid to keep cid allocation
+ * compact to cover the case where this task is the last task using
+ * this mm on the source cpu. If there happens to be other tasks left
+ * on the source cpu using this mm, the next task using this mm will
+ * reallocate its cid on context switch.
+ *
+ * We cannot keep ownership of concurrency ID without runqueue
+ * lock held when it is not used by a current task, because it
+ * would lead to allocation of more concurrency ids than there
+ * are possible cpus in the system. The last_mm_cid is used as
+ * a hint to conditionally unset the dst cpu cid, keeping
+ * allocated concurrency ids compact.
+ */
+ if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
+ return;

So we set LAZY, and because that switch above will not observe this
flag, we must check again:

And if there has indeed been a switch; that CPU will have gone through
at least one smp_mb() (there's one at the start of __schedule()), so
either way, it will see the LAZY or we will see the update or both.

+
+ /*
+ * If we observe an active task using the mm on this rq after setting the lazy-put
+ * flag, this task will be responsible for transitioning from lazy-put
+ * flag set to MM_CID_UNSET.
+ */
+ rcu_read_lock();
+ src_task = rcu_dereference(src_rq->curr);
+ if (src_task->mm_cid_active && src_task->mm == mm) {
+ rcu_read_unlock();
+ /*
+ * We observed an active task for this mm, clearing the destination
+ * cpu mm_cid is not relevant for compactness.
+ */
+ t->last_mm_cid = -1;
+ return;
+ }
+ rcu_read_unlock();

It is still unused, so wipe it.

+
+ if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
+ return;
+ __mm_cid_put(mm, src_cid);
+}

Did I miss any races?

I think your analysis is correct. The full barrier I thought was missing between
store to rq->curr and load of per-mm/cpu cid is indeed in context_switch() in the
next->mm != NULL case:

membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
lru_gen_use_mm(next->mm);

if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}

And in case you would happen to wonder how wrote that big comment about
the barrier, it happens to be me from 2018. :-)

So that barrier takes care of ordering store to rq->curr before load
of per-mm/cpu cid.

And as you said, the barrier at the beginning of __schedule() takes care
of ordering the load of per-mm/cpu cid of the previous context switch with
respect to the next context switch store to rq->curr.

The only other thing I am concerned about is a concurrent update to
t->mm_cid_active of the src rq current task happening concurrently with
sched_mm_cid_migrate_from() migrating a task from that src rq. It is set
to 0 on the current task by sched_mm_cid_exit_signals() and
sched_mm_cid_before_execve(), and set to 1 by sched_mm_cid_after_execve().

In both cases where it is set to 0, there is also a mm_cid_put() which
releases the cid (same effect as what we aim to do in migrate-from). But
I think we need additional barriers here, e.g. clearing t->mm_cid_active
should be done before calling mm_cid_put(t), so a concurrent migrate-from
does not set a lazy-flag that will never be handled. We should also add
a barrier between mm_cid_get() and setting t->mm_cid_active to 1 in
sched_mm_cid_after_execve.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com