Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Mathieu Desnoyers
Date: Tue Sep 03 2024 - 19:23:07 EST


On 2024-09-03 15:59, Yury Norov wrote:
On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
[...]
+
+static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
+{
+ struct cpumask *mm_allowed = mm_cpus_allowed(mm);
+ int cpu, nr_set = 0;
+
+ if (!mm)
+ return;
+ /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
+ for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
+ if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
+ nr_set++;
+ }

You can do the same nicer:

for_each_cpu(cpu, cpumask)
nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);

This should be faster and a bit simpler, to me.

In this scenario, I expect the following per-thread cpumask properties for a given process (typically): those will be typically the same bits
set repeated over all threads belonging to a process. There are of
course scenarios where specific threads will override the mask, but
I don't expect this to be the most frequent case.

So we typically have an operation which initially copies the initial
thread's allowed cpus mask to the mm allowed cpus mask, and then when
additional affinity changes are done, we want to augment the mm allowed
cpus masks with any additional cpu that may show up. But again, I expect
the initial thread to typically have the complete mask and other
operations won't typically change the mm allowed cpumask bits.

I also expect the cpumask to be often quite dense (often all bits
are set).

Now if we look at the operations for your proposal here:

- for_each_cpu loads cpumask word-by-word and for each set bit, it
issues cpumask_test_and_set_cpu on mm_allowed, which is really a
test_and_set_bit, a fully ordered atomic operation, on each _bit_
set. That's O(nr_cpus) fully ordered atomic operations, and thus
expensive exclusive cache line accesses.

My approach does:

- The equivalent of a for_each_cpu_andnot (actually I should use
exactly that! I just noticed it exists in the API.), which loads
both thread and mm CPUs allowed masks in parallel, word-by-word,
and only issues a cpumask_test_and_set_cpu for CPUs which are set
in the per-thread mask, but not in the mm mask. In the typical cases
discussed above, we pretty much never need to issue the atomic
test-and-set. So all we need to do for the common case is to read
both cpu masks in parallel, no stores/atomic ops needed.

What concerns me is that you call atomic function in a loop, which makes
the whole procedure non-atomic. If it's OK, can you put a comment why a
series of atomic ops is OK here? If not - I believe external locking
would be needed.

Good point, so the whole mm CPUs allowed masks tracks the allowed set,
and based on the result of the test_and_set it accumulates the number of
bits set and atomically add the total to nr_allowed_cpus. The mm_cid
algorithms only care about mm nr_allowed_cpus, so those don't even need
to look at the mm CPUs allowed mask, therefore there is no need to
provide any ordering guarantees across the two data structures.

If we'd have a cpumask_test_and_set_cpu_relaxed() it would be sufficient
here.

As you say, I should explain this in a comment.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com