Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Mathieu Desnoyers
Date: Wed Sep 04 2024 - 11:53:04 EST

Next message: Alexander Lobakin: "[PATCH iwl-next v2 0/6] idpf: XDP chapter II: convert Tx completion to libeth"
Previous message: Maxime Chevallier: "Re: [PATCH net-next v2 3/7] net: ethernet: fs_enet: drop the .adjust_link custom fs_ops"
In reply to: Yury Norov: "Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2024-09-04 11:24, Yury Norov wrote:

On Tue, Sep 03, 2024 at 07:22:37PM -0400, Mathieu Desnoyers wrote:

On 2024-09-03 15:59, Yury Norov wrote:

On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:

[...]

+
+static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
+{
+ struct cpumask *mm_allowed = mm_cpus_allowed(mm);
+ int cpu, nr_set = 0;
+
+ if (!mm)
+ return;
+ /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
+ for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
+ if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
+ nr_set++;
+ }

You can do the same nicer:

for_each_cpu(cpu, cpumask)
nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);

This should be faster and a bit simpler, to me.

In this scenario, I expect the following per-thread cpumask properties for a
given process (typically): those will be typically the same bits
set repeated over all threads belonging to a process. There are of
course scenarios where specific threads will override the mask, but
I don't expect this to be the most frequent case.

So we typically have an operation which initially copies the initial
thread's allowed cpus mask to the mm allowed cpus mask, and then when
additional affinity changes are done, we want to augment the mm allowed
cpus masks with any additional cpu that may show up. But again, I expect
the initial thread to typically have the complete mask and other
operations won't typically change the mm allowed cpumask bits.

I also expect the cpumask to be often quite dense (often all bits
are set).

Now if we look at the operations for your proposal here:

- for_each_cpu loads cpumask word-by-word and for each set bit, it
issues cpumask_test_and_set_cpu on mm_allowed, which is really a
test_and_set_bit, a fully ordered atomic operation, on each _bit_
set. That's O(nr_cpus) fully ordered atomic operations, and thus
expensive exclusive cache line accesses.

Both versions are O(N).

Yes, those are both theoretically O(N), but the cost of
loading/comparing two words compared to the cost of an atomic
test-and-set of each bit within those words is far from being
the same.

My approach does:

- The equivalent of a for_each_cpu_andnot (actually I should use
exactly that! I just noticed it exists in the API.), which loads

Yes, you should.

both thread and mm CPUs allowed masks in parallel, word-by-word,
and only issues a cpumask_test_and_set_cpu for CPUs which are set
in the per-thread mask, but not in the mm mask. In the typical cases
discussed above, we pretty much never need to issue the atomic
test-and-set. So all we need to do for the common case is to read
both cpu masks in parallel, no stores/atomic ops needed.

This all doesn't look like a hot path. And anyways, speculating around
performance without numbers on hands sounds cheap.

This is done whenever userspace invokes sched_setaffinity, or changes
its cgroup cpuset. It may not be the most important fast-path in the
world, but I expect some workloads to issue sched_setaffinity whenever
they create a thread, so it's not a purely slow-path either.

In my experience, iterators with a very lightweight payload are ~100
times slower comparing to dedicated bitmap ops. Check this for example:
3cea8d4753277.

If you're really cared about performance here, I'd suggest you to
compare your iterators approach with something like this:

cpumask_or(mm_allowed, mm_allowed, cpumask);
atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);

AFAIU this approach would not work, or then we'd need some kind of
locking to make sure we don't have two concurrent cpumask updates
which happen to do a sequence of atomic_set which move the
nr_cpus_allowed value towards a smaller value due to ordering.

Also, AFAIU cpumask_or is not safe against concurrent updates, so
we'd need locking for that as well.

I'm fine providing performance numbers, but we need to make sure
the alternative we compare against actually works.

A "dedicated" bitmap op that would do what I need would do the
following:

- For each bit set in bitmap A, which are not set in bitmap B,
atomically test-and-set those bits in bitmap B, and return the
number of bits that transitioned from 0 to 1 for weighting
purposes.

In my original attempts I tried using cpumask_weight after altering
the bitmap, but then noticed that it would not work without additional
locking.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Alexander Lobakin: "[PATCH iwl-next v2 0/6] idpf: XDP chapter II: convert Tx completion to libeth"
Previous message: Maxime Chevallier: "Re: [PATCH net-next v2 3/7] net: ethernet: fs_enet: drop the .adjust_link custom fs_ops"
In reply to: Yury Norov: "Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]