Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

From: Ricardo Neri
Date: Wed Oct 26 2022 - 23:23:39 EST


On Wed, Oct 26, 2022 at 10:55:11AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 25, 2022 at 08:57:24PM -0700, Ricardo Neri wrote:
>
> > Do you want me to add your Signed-off-by and Co-developed-by tags?
>
> Nah; who cares ;-)
>
>
> > > @@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
> > > if (!busy_cpus)
> > > return;
> > >
> > > - score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> > > - dst_cpu);
> > > + score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);
> > >
> > > - /*
> > > - * The simpest case. The single busy CPU in the current group will
> > > - * become idle after pulling its current task. The destination CPU is
> > > - * idle.
> > > - */
> > > - if (busy_cpus == 1) {
> > > - sgs->task_class_score_before = class_sgs->sum_score;
> > > - sgs->task_class_score_after = score_on_dst_cpu;
> > > - return;
> > > - }
> > > + before = sgcs->sum_score
> > > + after = before - sgcs->min_score + score_on_dst_cpu;
> >
> > This works when the sched group being evaluated has only one busy CPU
> > because it will become idle if the destination CPU (which was idle) pulls
> > the current task.
> >
> > >
> > > - /*
> > > - * Now compute the group score with and without the task with the
> > > - * lowest score. We assume that the tasks that remain in the group share
> > > - * the CPU resources equally.
> > > - */
> > > - group_score = class_sgs->sum_score / busy_cpus;
> > > -
> > > - group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> > > - (busy_cpus - 1);
> > > + if (busy_cpus > 1) {
> > > + before /= busy_cpus;
> > > + after /= busy_cpus;
> >
> >
> > However, I don't think this works when the sched group has more than one
> > busy CPU. 'before' and 'after' reflect the total throughput score of both
> > the sched group *and* the destination CPU.
> >
> > One of the CPUs in the sched group will become idle after the balance.
> >
> > Also, at this point we have already added score_on_dst_cpu. We are incorrectly
> > scaling it by the number of busy CPUs in the sched group.
> >
> > We instead must scale 'after' by busy_cpus - 1 and then add score_on_dst_cpu.
>
> So none of that makes sense.
>
> 'x/n + y' != '(x+y)/(n+1)'
>
> IOW:
>
> > > + }
> > >
> > > - sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> > > - sgs->task_class_score_before = group_score;
> > > + sgs->task_class_score_before = before;
> > > + sgs->task_class_score_after = after;
>
> your task_class_score_after is a sum value for 2 cpus worth not a value
> for a single cpu,

Agreed.

> while your task_class_score_before is a single cpu
> average.

Agreed. You can also regard task_class_score_before as a value for 2 CPUs
worth, only that the contribution to throughput of dst_cpu is 0.

> You can't compare these numbers and have a sensible outcome.
>
> If you have a number of values: x_1....x_n, their average is
> Sum(x_1...x_n) / n, which is a single value again.
>
> If you want to update one of the x's, say x_i->x'_i, then the average
> changes like:
>
> Sum(x_1...x_n-x_i+x'_i) / n
>
> If you want to remove one of the x's, then you get:
>
> Sum(x_1...x_n-x_i) / (n-1) ; 0<i<n
>
> if you want to add an x:
>
> Sum(x_1...x_n+i_i) / (n+1) ; i>n
>
> Nowhere would you ever get:
>
> Sum(x_1...x_n) / n + x_i
>
> That's just straight up nonsense.

But we are not computing the average throughput. We are computing the
*total* throughput of two CPUs. Hence, what we need is the sum of the
throughput score of both CPUs.

We may be here because the sched group of sgcs is composed of SMT
siblings. Hence, we divide by busy_cpus assuming that all busy siblings
share the core resources evenly. (For non-SMT sched groups, busy_cpus is 1
at most).

>
> So I might buy an argument for:
>
> if (busy_cpus > 1) {
> before /= (busy_cpus-1);
> after /= (busy_cpus+1);
> }
>
> Or something along those lines (where you remove an entry from before
> and add it to after), but not this.

The entry that we remove from before will go to after. The entry will be
placed in the local group. This group has a different busy_cpus: all of its
SMT siblings, if any, idle (otherwise, we would not be here). It will
become 1 after the balance.