Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm_cid

From: Peter Zijlstra
Date: Thu Apr 06 2023 - 05:53:52 EST


On Wed, Apr 05, 2023 at 04:37:42PM -0400, Mathieu Desnoyers wrote:
> On 2023-04-05 12:26, Mathieu Desnoyers wrote:
> [...]
> > #ifdef CONFIG_SCHED_MM_CID
> > +/*
> > + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
> > + * that the rq lock is held.
> > + */
> > +void sched_mm_cid_migrate_from(struct task_struct *t)
> > +{
> > + int src_cid, *src_pcpu_cid, last_mm_cid;
> > + struct mm_struct *mm = t->mm;
> > + struct rq *src_rq;
> > + struct task_struct *src_task;
> > +
> > + if (!mm)
> > + return;
> > +
> > + last_mm_cid = t->last_mm_cid;
> > + /*
> > + * If the migrated task has no last cid, or if the current
> > + * task on src rq uses the cid, it means the destination cpu
> > + * does not have to reallocate its cid to keep the cid allocation
> > + * compact.
> > + */
> > + if (last_mm_cid == -1)
> > + return;
> > +
> > + src_rq = task_rq(t);
> > + src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> > + src_cid = READ_ONCE(*src_pcpu_cid);
> > +
> > + if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
> > + return;
> > +
> > + /*
> > + * If we observe an active task using the mm on this rq, it means we
> > + * are not the last task to be migrated from this cpu for this mm, so
> > + * there is no need to clear the src_cid.
> > + */
> > + rcu_read_lock();
> > + src_task = rcu_dereference(src_rq->curr);
> > + if (src_task->mm_cid_active && src_task->mm == mm) {
> > + rcu_read_unlock();
> > + t->last_mm_cid = -1;
> > + return;
> > + }
> > + rcu_read_unlock();
> > +
> > + /*
> > + * If the source cpu cid is set, and matches the last cid of the
> > + * migrated task, clear the source cpu cid to keep cid allocation
> > + * compact to cover the case where this task is the last task using
> > + * this mm on the source cpu. If there happens to be other tasks left
> > + * on the source cpu using this mm, the next task using this mm will
> > + * reallocate its cid on context switch.
> > + *
> > + * We cannot keep ownership of concurrency ID without runqueue
> > + * lock held when it is not used by a current task, because it
> > + * would lead to allocation of more concurrency ids than there
> > + * are possible cpus in the system. The last_mm_cid is used as
> > + * a hint to conditionally unset the dst cpu cid, keeping
> > + * allocated concurrency ids compact.
> > + */
> > + if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
> > + return;
> > +
> > + /*
> > + * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> > + * rq->curr->mm matches the scheduler barrier in context_switch()
> > + * between store to rq->curr and load of prev and next task's
> > + * per-mm/cpu cid.
>
> Thinking about it further, I suspect the transition:
>
> * user -> kernel lazy + mmgrab() active
>
> in context_switch() lacks a memory barrier we need here (it's only an
> atomic_inc with mmgrab()).
>
> The scenario is a transition from user to kernel thread happening
> concurrently with migrate-from.
>
> Because there is no memory barrier between set rq->curr to a value that
> has a NULL mm and loading per-mm/cpu cid value in mm_cid_put_lazy() for the
> prev task, nothing guarantees that the following src_rq->curr rcu
> dereference here will observe the store.
>
> Or am I missing something ?

OK, so last night I thought it was me needing sleep (which might still
be the case), but now I'm still not quite getting it. Let me draw a
picture, perhaps that'll help...

It reads to me like you're treating a barrier as 'makes visible', which
it never will.


CPU0 will run a user task A and switch to a kernel thread K;
CPU1 will concurrently run sched_mm_cid_migrate_from().


CPU0 CPU1

.. runs A ..

if (src_task->mm_cid_active && src_task->mm == mm)
// false, proceed

cmpxchg(); // set LAZY

__schedule()
(A)
rq->curr = K;
(B)
context_switch()
prepare_task_switch()
switch_mm_cid()
cid_put_lazy(prev)
cid_get(next)
mmgrab(); // user->kernel active_mm
switch_to();



.. runs K ..


And it is this second compare that can either observe A or K, right?
That is the locations A or B above. Later doesn't make sense because
then switch_mm_cid() will have taken care of things.

If A, still false, and we continue to mark UNSET and put.
If B, we see K, which will not have mm_cid_active set so still false and
same as above.

What am I missing?