Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: Peter Zijlstra
Date: Thu Apr 13 2023 - 07:11:13 EST


On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
> On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> >
> > > I *guess* you might be able to see some contention with hackbench on
> > > that HSW-EX system with v4.
> >
> > Indeed! Notably it seems to be the wakeup from idle that trips it
> > hardest:
>
> Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
> the sake of compacting and when task wakes there, it will have to
> re-allocate a new cid through mm_get_cid() which needs to acquire
> mm->cid_lock?

Yup. And I'm thinking it is futile (and counter productive) to strive
for compactness in this (nr_threads >= nr_cpus) case.

The below on v4 solves the contention I see with hackbench (which runs
400 threads which is well above the 144 cpu count on that machine).

This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
perhaps). Also, I would be thinking that's not something that typically
happens.

Mathieu, WDYT? -- other than that the patch is an obvious hack :-)

---
include/linux/mm_types.h | 8 ++++++++
kernel/fork.c | 4 +++-
kernel/sched/core.c | 9 +++++++++
kernel/sched/sched.h | 2 ++
4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4160ff5c6ebd..598d1b657afa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -609,6 +609,7 @@ struct mm_struct {
* were being concurrently updated by the updaters.
*/
raw_spinlock_t cid_lock;
+ int cid_saturated;
/**
* @pcpu_cid: Per-cpu current cid.
*
@@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
return cid & ~MM_CID_LAZY_PUT;
}

+static inline void mm_cid_desaturate(struct mm_struct *mm)
+{
+ if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
+ mm->cid_saturated = 0;
+}
+
/* Accessor for struct mm_struct's cidmask. */
static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
{
@@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
int i;

raw_spin_lock_init(&mm->cid_lock);
+ mm->cid_saturated = 0;
for_each_possible_cpu(i)
*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
cpumask_clear(mm_cidmask(mm));
diff --git a/kernel/fork.c b/kernel/fork.c
index 3832bea713c4..a5233e450435 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
might_sleep();

if (atomic_dec_and_test(&mm->mm_users))
- __mmput(mm);
+ return __mmput(mm);
+
+ mm_cid_desaturate(mm);
}
EXPORT_SYMBOL_GPL(mmput);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 425766cc1300..d5004d179531 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
if (last_mm_cid == -1)
return;

+ /*
+ * When nr_threads > nr_cpus, there is no point in moving anything
+ * around to keep it compact.
+ */
+ if (mm->cid_saturated) {
+ t->last_mm_cid = -1;
+ return;
+ }
+
src_rq = task_rq(t);
src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
src_cid = READ_ONCE(*src_pcpu_cid);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..6c4af2992e79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
}
raw_spin_lock(&mm->cid_lock);
cid = __mm_cid_get_locked(mm);
+ if (cid == nr_cpu_ids - 1)
+ mm->cid_saturated = 1;
raw_spin_unlock(&mm->cid_lock);
WRITE_ONCE(*pcpu_cid, cid);
return cid;