Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

From: Peter Zijlstra
Date: Wed Oct 26 2022 - 04:55:44 EST


On Tue, Oct 25, 2022 at 08:57:24PM -0700, Ricardo Neri wrote:

> Do you want me to add your Signed-off-by and Co-developed-by tags?

Nah; who cares ;-)


> > @@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
> > if (!busy_cpus)
> > return;
> >
> > - score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> > - dst_cpu);
> > + score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);
> >
> > - /*
> > - * The simpest case. The single busy CPU in the current group will
> > - * become idle after pulling its current task. The destination CPU is
> > - * idle.
> > - */
> > - if (busy_cpus == 1) {
> > - sgs->task_class_score_before = class_sgs->sum_score;
> > - sgs->task_class_score_after = score_on_dst_cpu;
> > - return;
> > - }
> > + before = sgcs->sum_score
> > + after = before - sgcs->min_score + score_on_dst_cpu;
>
> This works when the sched group being evaluated has only one busy CPU
> because it will become idle if the destination CPU (which was idle) pulls
> the current task.
>
> >
> > - /*
> > - * Now compute the group score with and without the task with the
> > - * lowest score. We assume that the tasks that remain in the group share
> > - * the CPU resources equally.
> > - */
> > - group_score = class_sgs->sum_score / busy_cpus;
> > -
> > - group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> > - (busy_cpus - 1);
> > + if (busy_cpus > 1) {
> > + before /= busy_cpus;
> > + after /= busy_cpus;
>
>
> However, I don't think this works when the sched group has more than one
> busy CPU. 'before' and 'after' reflect the total throughput score of both
> the sched group *and* the destination CPU.
>
> One of the CPUs in the sched group will become idle after the balance.
>
> Also, at this point we have already added score_on_dst_cpu. We are incorrectly
> scaling it by the number of busy CPUs in the sched group.
>
> We instead must scale 'after' by busy_cpus - 1 and then add score_on_dst_cpu.

So none of that makes sense.

'x/n + y' != '(x+y)/(n+1)'

IOW:

> > + }
> >
> > - sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> > - sgs->task_class_score_before = group_score;
> > + sgs->task_class_score_before = before;
> > + sgs->task_class_score_after = after;

your task_class_score_after is a sum value for 2 cpus worth, not a value
for a single cpu, while your task_class_score_before is a single cpu
average. You can't compare these numbers and have a sensible outcome.

If you have a number of values: x_1....x_n, their average is
Sum(x_1...x_n) / n, which is a single value again.

If you want to update one of the x's, say x_i->x'_i, then the average
changes like:

Sum(x_1...x_n-x_i+x'_i) / n

If you want to remove one of the x's, then you get:

Sum(x_1...x_n-x_i) / (n-1) ; 0<i<n

if you want to add an x:

Sum(x_1...x_n+i_i) / (n+1) ; i>n

Nowhere would you ever get:

Sum(x_1...x_n) / n + x_i

That's just straight up nonsense.

So I might buy an argument for:

if (busy_cpus > 1) {
before /= (busy_cpus-1);
after /= (busy_cpus+1);
}

Or something along those lines (where you remove an entry from before
and add it to after), but not this.