Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort

From: Vincent Guittot
Date: Thu Oct 28 2021 - 08:15:37 EST


Le mercredi 27 oct. 2021 à 13:53:32 (-0700), Tim Chen a écrit :
> On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> >
> > > Looking at the profile on update_blocked_averages a bit more,
> > > the majority of the call to update_blocked_averages
> > > happens in run_rebalance_domain. And we are not
> > > including that cost of update_blocked_averages for
> > > run_rebalance_domains in our current patch set. I think
> > > the patch set should account for that too.
> >
> > nohz_newidle_balance keeps using sysctl_sched_migration_cost to
> > trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> > This would probably benefit to take into account the cost of
> > update_blocked_averages instead
> >
>
> For the case where
>
> this_rq->avg_idle < sysctl_sched_migration_cost
>
> in newidle_balance(), we skip to the out: label
>
> out:
> /* Move the next balance forward */
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> else
> nohz_newidle_balance(this_rq);
>
> and we call nohz_newidle_balance as we don't have a pulled_task.
>
> It seems to make sense to skip the call
> to nohz_newidle_balance() for this case?

nohz_newidle_balance() also tests this condition :
(this_rq->avg_idle < sysctl_sched_migration_cost)
and doesn't set NOHZ_NEWILB_KICKi in such case

But this patch now used the condition :
this_rq->avg_idle < sd->max_newidle_lb_cost
and sd->max_newidle_lb_cost can be higher than sysctl_sched_migration_cost

which means that we can set NOHZ_NEWILB_KICK:
-although we decided to skip newidle loop
-or when we abort because this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost

This is even more true when sysctl_sched_migration_cost is lowered which is your case IIRC

The patch below ensures that we don't set NOHZ_NEWILB_KICK in such cases:

---
kernel/sched/fair.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c19f4bb3df1a..36ddae208959 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10779,7 +10779,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int this_cpu = this_rq->cpu;
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
- int pulled_task = 0;
+ int pulled_task = 0, early_stop = 0;

update_misfit_status(NULL, this_rq);

@@ -10816,8 +10816,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- if (sd)
+ if (sd) {
update_next_balance(sd, &next_balance);
+
+ /*
+ * We skip new idle LB because there is not enough
+ * time before next wake up. Make sure that we will
+ * not kick NOHZ_NEWILB_KICK
+ */
+ early_stop = 1;
+ }
rcu_read_unlock();

goto out;
@@ -10836,8 +10844,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

update_next_balance(sd, &next_balance);

- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+ early_stop = 1;
break;
+ }

if (sd->flags & SD_BALANCE_NEWIDLE) {

@@ -10887,7 +10897,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (pulled_task)
this_rq->idle_stamp = 0;
- else
+ else if (!early_stop)
nohz_newidle_balance(this_rq);

rq_repin_lock(this_rq, rf);
--

> We expect a very short idle and a task to wake shortly.
> So we do not have to pull a task
> to this idle cpu and incur the migration cost.
>
> Tim
>
>
>
>