[PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

From: Ankit Jain
Date: Wed Oct 11 2023 - 03:20:42 EST


commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
introduced the logic to distribute the tasks at initial wakeup on cpus
where load balancing works poorly or disabled at all (isolated cpus).

There are cases in which the distribution of tasks
that are spawned on isolcpus does not happen properly.
In production deployment, initial wakeup of tasks spawn from
housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
within isolcpus range instead of distributed across isolcpus.

Usage of distribute_cpu_mask_prev from one processes group,
will clobber previous value of another or other groups and vice-versa.

When housekeeping cpus spawn multiple child tasks to wakeup on
isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
distribution is currently performed based on per-cpu
distribute_cpu_mask_prev counter.
At the same time, on housekeeping cpus there are percpu
bounded timers interrupt/rcu threads and other system/user tasks
would be running with affinity as housekeeping cpus. In a real-life
environment, housekeeping cpus are much fewer and are too much loaded.
So, distribute_cpu_mask_prev value from these tasks impacts
the offset value for the tasks spawning to wakeup on isolcpus and
thus most of the tasks end up waking up on first cpu within the
isolcpus set.

Steps to reproduce:
Kernel cmdline parameters:
isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1

* pid=$(echo $$)
* taskset -pc 0 $pid
* cat loop-normal.c
int main(void)
{
while (1)
;
return 0;
}
* gcc -o loop-normal loop-normal.c
* for i in {1..50}; do ./loop-normal & done
* pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
* for i in $pids; do taskset -pc 2-5 $i ; done

Expected output:
* All 50 “loop-normal” tasks should wake up on cpu2-5
equally distributed.
* ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"

Actual output:
* All 50 “loop-normal” tasks got woken up on cpu2 only

Analysis:
There are percpu bounded timer interrupt/rcu threads activities
going on every few microseconds on housekeeping cpus, exercising
find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
keep on getting set to housekeeping cpus. Bash/docker processes
are sharing same per cpu variable as they run on housekeeping cpus.
Thus intersection of clobbered distribute_cpu_mask_prev and
new mask(isolcpus) return always first cpu within the new mask(isolcpus)
in accordance to the logic mentioned in commits above.

Fix the issue by using random cores out of the applicable CPU set
instead of relying on distribute_cpu_mask_prev.

Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")

Signed-off-by: Ankit Jain <ankitja@xxxxxxxxxx>
---
lib/cpumask.c | 40 +++++++++++++++++++++-------------------
1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index a7fd02b5ae26..95a7c1b40e95 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -155,45 +155,47 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
EXPORT_SYMBOL(cpumask_local_spread);

-static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
-
/**
* cpumask_any_and_distribute - Return an arbitrary cpu within src1p & src2p.
* @src1p: first &cpumask for intersection
* @src2p: second &cpumask for intersection
*
- * Iterated calls using the same srcp1 and srcp2 will be distributed within
- * their intersection.
+ * Iterated calls using the same srcp1 and srcp2 will be randomly distributed
+ * within their intersection.
*
* Returns >= nr_cpu_ids if the intersection is empty.
*/
unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
const struct cpumask *src2p)
{
- unsigned int next, prev;
+ unsigned int n_cpus, nth_cpu;

- /* NOTE: our first selection will skip 0. */
- prev = __this_cpu_read(distribute_cpu_mask_prev);
+ n_cpus = cpumask_weight_and(src1p, src2p);
+ if (n_cpus == 0)
+ return nr_cpu_ids;

- next = find_next_and_bit_wrap(cpumask_bits(src1p), cpumask_bits(src2p),
- nr_cpumask_bits, prev + 1);
- if (next < nr_cpu_ids)
- __this_cpu_write(distribute_cpu_mask_prev, next);
+ nth_cpu = get_random_u32_below(n_cpus);

- return next;
+ return find_nth_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+ nr_cpumask_bits, nth_cpu);
}
EXPORT_SYMBOL(cpumask_any_and_distribute);

+/**
+ * Returns an arbitrary cpu within srcp.
+ *
+ * Iterated calls using the same srcp will be randomly distributed
+ */
unsigned int cpumask_any_distribute(const struct cpumask *srcp)
{
- unsigned int next, prev;
+ unsigned int n_cpus, nth_cpu;

- /* NOTE: our first selection will skip 0. */
- prev = __this_cpu_read(distribute_cpu_mask_prev);
- next = find_next_bit_wrap(cpumask_bits(srcp), nr_cpumask_bits, prev + 1);
- if (next < nr_cpu_ids)
- __this_cpu_write(distribute_cpu_mask_prev, next);
+ n_cpus = cpumask_weight(srcp);
+ if (n_cpus == 0)
+ return nr_cpu_ids;

- return next;
+ nth_cpu = get_random_u32_below(n_cpus);
+
+ return find_nth_bit(cpumask_bits(srcp), nr_cpumask_bits, nth_cpu);
}
EXPORT_SYMBOL(cpumask_any_distribute);
--
2.23.1