On Tue, Sep 03, 2024 at 07:22:37PM -0400, Mathieu Desnoyers wrote:
On 2024-09-03 15:59, Yury Norov wrote:
On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:[...]
+
+static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
+{
+ struct cpumask *mm_allowed = mm_cpus_allowed(mm);
+ int cpu, nr_set = 0;
+
+ if (!mm)
+ return;
+ /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
+ for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
+ if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
+ nr_set++;
+ }
You can do the same nicer:
for_each_cpu(cpu, cpumask)
nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
This should be faster and a bit simpler, to me.
In this scenario, I expect the following per-thread cpumask properties for a
given process (typically): those will be typically the same bits
set repeated over all threads belonging to a process. There are of
course scenarios where specific threads will override the mask, but
I don't expect this to be the most frequent case.
So we typically have an operation which initially copies the initial
thread's allowed cpus mask to the mm allowed cpus mask, and then when
additional affinity changes are done, we want to augment the mm allowed
cpus masks with any additional cpu that may show up. But again, I expect
the initial thread to typically have the complete mask and other
operations won't typically change the mm allowed cpumask bits.
I also expect the cpumask to be often quite dense (often all bits
are set).
Now if we look at the operations for your proposal here:
- for_each_cpu loads cpumask word-by-word and for each set bit, it
issues cpumask_test_and_set_cpu on mm_allowed, which is really a
test_and_set_bit, a fully ordered atomic operation, on each _bit_
set. That's O(nr_cpus) fully ordered atomic operations, and thus
expensive exclusive cache line accesses.
Both versions are O(N).
My approach does:
- The equivalent of a for_each_cpu_andnot (actually I should use
exactly that! I just noticed it exists in the API.), which loads
Yes, you should.
both thread and mm CPUs allowed masks in parallel, word-by-word,
and only issues a cpumask_test_and_set_cpu for CPUs which are set
in the per-thread mask, but not in the mm mask. In the typical cases
discussed above, we pretty much never need to issue the atomic
test-and-set. So all we need to do for the common case is to read
both cpu masks in parallel, no stores/atomic ops needed.
This all doesn't look like a hot path. And anyways, speculating around
performance without numbers on hands sounds cheap.
In my experience, iterators with a very lightweight payload are ~100
times slower comparing to dedicated bitmap ops. Check this for example:
3cea8d4753277.
If you're really cared about performance here, I'd suggest you to
compare your iterators approach with something like this:
cpumask_or(mm_allowed, mm_allowed, cpumask);
atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);