Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask
From: Yury Norov
Date: Wed Oct 11 2023 - 20:18:37 EST
On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks at initial wakeup on cpus
> where load balancing works poorly or disabled at all (isolated cpus).
>
> There are cases in which the distribution of tasks
> that are spawned on isolcpus does not happen properly.
> In production deployment, initial wakeup of tasks spawn from
> housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
> within isolcpus range instead of distributed across isolcpus.
>
> Usage of distribute_cpu_mask_prev from one processes group,
> will clobber previous value of another or other groups and vice-versa.
>
> When housekeeping cpus spawn multiple child tasks to wakeup on
> isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
> distribution is currently performed based on per-cpu
> distribute_cpu_mask_prev counter.
> At the same time, on housekeeping cpus there are percpu
> bounded timers interrupt/rcu threads and other system/user tasks
> would be running with affinity as housekeeping cpus. In a real-life
> environment, housekeeping cpus are much fewer and are too much loaded.
> So, distribute_cpu_mask_prev value from these tasks impacts
> the offset value for the tasks spawning to wakeup on isolcpus and
> thus most of the tasks end up waking up on first cpu within the
> isolcpus set.
>
> Steps to reproduce:
> Kernel cmdline parameters:
> isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
> rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
>
> * pid=$(echo $$)
> * taskset -pc 0 $pid
> * cat loop-normal.c
> int main(void)
> {
> while (1)
> ;
> return 0;
> }
> * gcc -o loop-normal loop-normal.c
> * for i in {1..50}; do ./loop-normal & done
> * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
> * for i in $pids; do taskset -pc 2-5 $i ; done
>
> Expected output:
> * All 50 “loop-normal” tasks should wake up on cpu2-5
> equally distributed.
> * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
>
> Actual output:
> * All 50 “loop-normal” tasks got woken up on cpu2 only
>
> Analysis:
> There are percpu bounded timer interrupt/rcu threads activities
> going on every few microseconds on housekeeping cpus, exercising
> find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
> So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
> keep on getting set to housekeeping cpus. Bash/docker processes
> are sharing same per cpu variable as they run on housekeeping cpus.
> Thus intersection of clobbered distribute_cpu_mask_prev and
> new mask(isolcpus) return always first cpu within the new mask(isolcpus)
> in accordance to the logic mentioned in commits above.
>
> Fix the issue by using random cores out of the applicable CPU set
> instead of relying on distribute_cpu_mask_prev.
>
> Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
>
> Signed-off-by: Ankit Jain <ankitja@xxxxxxxxxx>
> ---
> lib/cpumask.c | 40 +++++++++++++++++++++-------------------
> 1 file changed, 21 insertions(+), 19 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index a7fd02b5ae26..95a7c1b40e95 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -155,45 +155,47 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> -static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> -
> /**
> * cpumask_any_and_distribute - Return an arbitrary cpu within src1p & src2p.
> * @src1p: first &cpumask for intersection
> * @src2p: second &cpumask for intersection
> *
> - * Iterated calls using the same srcp1 and srcp2 will be distributed within
> - * their intersection.
> + * Iterated calls using the same srcp1 and srcp2 will be randomly distributed
> + * within their intersection.
> *
> * Returns >= nr_cpu_ids if the intersection is empty.
> */
This has been discussed a while ago, and the bottomline is that 'any'
is not the same as 'random'. In practice, it means that getting 'any'
cpu is cheaper than getting randomized one.
I'm not that deep in context of the problem you're trying to solve, but
if you need randomized CPU, can you just add a new function for it?
Something like cpumask_get_random().
The approach with find_nth_bit() itself looks good to me.
Thanks,
Yury
> unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> const struct cpumask *src2p)
> {
> - unsigned int next, prev;
> + unsigned int n_cpus, nth_cpu;
>
> - /* NOTE: our first selection will skip 0. */
> - prev = __this_cpu_read(distribute_cpu_mask_prev);
> + n_cpus = cpumask_weight_and(src1p, src2p);
> + if (n_cpus == 0)
> + return nr_cpu_ids;
>
> - next = find_next_and_bit_wrap(cpumask_bits(src1p), cpumask_bits(src2p),
> - nr_cpumask_bits, prev + 1);
> - if (next < nr_cpu_ids)
> - __this_cpu_write(distribute_cpu_mask_prev, next);
> + nth_cpu = get_random_u32_below(n_cpus);
>
> - return next;
> + return find_nth_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> + nr_cpumask_bits, nth_cpu);
> }
> EXPORT_SYMBOL(cpumask_any_and_distribute);
>
> +/**
> + * Returns an arbitrary cpu within srcp.
> + *
> + * Iterated calls using the same srcp will be randomly distributed
> + */
> unsigned int cpumask_any_distribute(const struct cpumask *srcp)
> {
> - unsigned int next, prev;
> + unsigned int n_cpus, nth_cpu;
>
> - /* NOTE: our first selection will skip 0. */
> - prev = __this_cpu_read(distribute_cpu_mask_prev);
> - next = find_next_bit_wrap(cpumask_bits(srcp), nr_cpumask_bits, prev + 1);
> - if (next < nr_cpu_ids)
> - __this_cpu_write(distribute_cpu_mask_prev, next);
> + n_cpus = cpumask_weight(srcp);
> + if (n_cpus == 0)
> + return nr_cpu_ids;
>
> - return next;
> + nth_cpu = get_random_u32_below(n_cpus);
> +
> + return find_nth_bit(cpumask_bits(srcp), nr_cpumask_bits, nth_cpu);
> }
> EXPORT_SYMBOL(cpumask_any_distribute);
> --
> 2.23.1