Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Mon Jan 29 2024 - 17:53:57 EST


On 01/28/24 23:50, Qais Yousef wrote:
> On 01/26/24 15:08, Vincent Guittot wrote:
>
> > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > balance_interval and this reproducer which is a not the same as the original
> > > one and it might be exposing another problem and I didn't think twice about it.
> >
> > I checked the behavior more deeply and I confirm that I don't see
> > improvement for the use case described above. I would say that it's
> > even worse as I can see some runs where the task stays on little
> > whereas a big core has been added in the affinity. Having in mind that
> > my system is pretty idle which means that there is almost no other
> > reason to trigger an ilb than the misfit task, the change in
> > check_misfit_status() is probably the reason for never kicking an ilb
> > for such case
>
> It seems I reproduced another problem while trying to reproduce the original
> issue, eh.
>
> I did dig more and from what I see the issue is that the rd->overload is not
> being set correctly. Which I believe what causes the delays (see attached
> picture how rd.overloaded is 0 with some spikes). Only when CPU7
> newidle_balance() coincided with rd->overload being 1 that the migration
> happens. With the below hack I can see that rd->overload is 1 all the time
> (even after the move as we still trigger a misfit on the big CPU). With my
> patch only rd->overload is set to 1 (because of this task) only for a short
> period after we change affinity.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df348aa55d3c..86069fe527f9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
> }
>
> - if (local_group)
> - continue;
> + /* if (local_group) */
> + /* continue; */
>
> if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> /* Check for a misfit task on the cpu */
>
> I am not sure what the right fix is, but it seems this condition is required
> for the 2nd leg of this if condition when we compare with load? I don't think
> we should skip the misfit check.

I'm still not sure I got the original intent of why we skip for local_group. We
need to set sg_status which operates at root domain to enable a cpu to
pull a misfit task.

AFAICS newidle_balance() will return without doing anything if rd->overload is
not set. So making sure we update this flag always and for both legs is
necessary IIUC

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..bd2f402eac41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9707,9 +9707,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
}

- if (local_group)
- continue;
-
if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
@@ -9719,8 +9716,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
} else if ((env->idle != CPU_NOT_IDLE) &&
sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
- if (sgs->group_misfit_task_load < load)
+ if (sgs->group_misfit_task_load < load) {
sgs->group_misfit_task_load = load;
+ *sg_status |= SG_OVERLOAD;
+ }
}
}

I was wondering why we never pull at TICK/rebalance_domains() where no such
check is made. But when newidle_balance() returns early it sets
update_next_balance() to add balance_interval which is already long. So we end
up delaying things further thinking we've 'attempted' a load balance and
it wasn't necessary - but in reality we failed to see it and not allowing the
rebalance_domains() to see it either by continuing to push it forward.