On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote:
Here's where I'm in doubt, is a compact map more desirable than
reusing
the same mm_cids for cache locality?
This is a tradeoff between:
A) Preserving cache locality after a transition from many threads to
few
threads, or after reducing the hamming weight of the allowed cpu
mask.
B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask
easy
to document and understand.
C) Allowing applications to eventually react to mm_cid compaction
after
reduction of the nr threads or allowed cpu mask, making the
tracking
of mm_cid compaction easier by shrinking it back towards 0 or
not.
D) Making sure applications that periodically reduce and then
increase
again the nr threads or allowed cpu mask still benefit from good
cache locality with mm_cid.
If not, should we perhaps ignore the recent_cid if it's larger than
the
map weight?
It seems the only way the recent_cid is unset is with migrations,
but
I'm not sure if forcing one would make the test vain as the cid
could
be dropped outside of task_mm_cid_work.
What do you think?
Can you try this patch ? (compile-tested only)
commit 500649e03c5c28443f431829732c580750657326
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date: Wed Dec 11 11:53:01 2024 -0500
sched: shrink mm_cid allocation with nr thread/affinity
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..b92e79770a93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct
task_struct *t, struct mm_struct *mm)
{
struct cpumask *cidmask = mm_cidmask(mm);
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
- int cid = __this_cpu_read(pcpu_cid->recent_cid);
+ int cid, max_nr_cid, allowed_max_nr_cid;
+ /*
+ * After shrinking the number of threads or reducing the number
+ * of allowed cpus, reduce the value of max_nr_cid so expansion
+ * of cid allocation will preserve cache locality if the number
+ * of threads or allowed cpus increase again.
+ */
+ max_nr_cid = atomic_read(&mm->max_nr_cid);
+ while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm-
nr_cpus_allowed), atomic_read(&mm->mm_users))),+ max_nr_cid > allowed_max_nr_cid) {
+ if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid,
allowed_max_nr_cid))
+ break;
+ }
/* Try to re-use recent cid. This improves cache locality. */
- if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid,
cidmask))
+ cid = __this_cpu_read(pcpu_cid->recent_cid);
+ if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+ !cpumask_test_and_set_cpu(cid, cidmask))
return cid;
/*
* Expand cid allocation if the maximum number of concurrency
@@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct
task_struct *t, struct mm_struct *mm)
* and number of threads. Expanding cid allocation as much as
* possible improves cache locality.
*/
- cid = atomic_read(&mm->max_nr_cid);
- while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid <
atomic_read(&mm->mm_users)) {
- if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
+ while (max_nr_cid < allowed_max_nr_cid) {
+ if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid +
1))
continue;
- if (!cpumask_test_and_set_cpu(cid, cidmask))
- return cid;
+ if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
+ return max_nr_cid;
}
/*
* Find the first available concurrency id.
Thanks for the patch, it seems much more robust than my simple
condition on the weight. It passes the test (both versions) we
previously discussed and doesn't seem to interfere with the general
rseq functionality as checked by the other selftests.
I'm not sure if I should run more tests on this one.
I will come up with a V2 shortly and attach some performance
evaluations.
Do you want to keep your patch separate or do I submit it together with
V2?
Thanks,
Gabriele