Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

From: Vincent Guittot
Date: Fri Nov 08 2019 - 04:18:42 EST


Hi Doug,

Le Wednesday 30 Oct 2019 à 16:27:08 (+0100), Vincent Guittot a écrit :
> On Wed, 30 Oct 2019 at 15:04, Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> >
> > On 2019.10.29 09:02 Vincent Guittot wrote:
> >
> > > Could you try the patch below ? It ensures that at least the root cfs rq stays
> > > in the list so each time update_blocked_averages is called, we will call update_cfs_rq_load_avg()
> > > for the root cfs_rq at least and even if everything already reach zero.
> > > This will ensure that cfs_rq_util_change is called even if nothing has
> > > changed.
> > >
> > > ---
> > > kernel/sched/fair.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 151c0b7..ac0a549 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7552,6 +7552,8 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {
> > >
> > > static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > {
> > > + struct rq *rq = rq_of(cfs_rq);
> > > +
> > > if (cfs_rq->load.weight)
> > > return false;
> > >
> > > @@ -7564,6 +7566,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > if (cfs_rq->avg.runnable_load_sum)
> > > return false;
> > >
> > > + if (cfs_rq == &rq->cfs)
> > > + return false;
> > > +
> > > return true;
> > > }
> > >
> > > --
> > > 2.7.4
> >
> > Yes, this patch works and solves the long time
> > between calls of the intel_pstate CPU frequency scaling
> > driver issue.
> > I see you sent a formal patch a few hours ago.
> > I'll try it and report back.
>
> The patch that I sent a few hours ago, doesn't solve your problem. It
> solves item 1 of my previous email.
>
> The fact that this hack fix your problem means that Intel pstate needs
> to be called periodically even if the cfs pelt signals are null and
> this is probably linked to the internal state machine of the driver.
> The current behavior for CFS makes perfectly sense because cfs signal
> is already null so we don't need to update freq because of cfs' signal
> Then it remains the rt, dl and irq signals which might not be null yet
> and which doesn't trigger a call to cpufreq_update_util whereas it
> could worth calling it.
>
> I have to prepare a patch for this part which is item 2

I have finally been able to prepared the patch for item 2. Could you check
that it also fixes your problem ?

Subject: [PATCH] sched/freq: move call to cpufreq_update_util

update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
which might be inefficient when cpufreq driver has rate limitation.

When a task is attached on a CPU, we have call path:

update_blocked_averages()
update_cfs_rq_load_avg()
cfs_rq_util_change -- > trig frequency update
attach_entity_load_avg()
cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discard because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
{cfs_rq,cpufreq}_util_change() into these 2 functions. It's also
interesting to notice that update_load_avg() already calls directly
cfs_rq_util_change() for !SMP case.

This changes will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update but only
irq, rt or dl pelt signals.

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---

We can still have some spurious call to cpufreq_util_change in
update_blocked_average() with this patch but at least the value will be
up to date in both calls, which was not the case before. If this fix
Doug's problem, I can prepare an additional one to fix the spurious call
but I wanted to make sure that this fix the problem first.


kernel/sched/fair.c | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index adc923f..4fd324e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3509,9 +3509,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
cfs_rq->load_last_update_time_copy = sa->last_update_time;
#endif

- if (decayed)
- cfs_rq_util_change(cfs_rq, 0);
-
return decayed;
}

@@ -3621,8 +3618,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
update_tg_load_avg(cfs_rq, 0);

- } else if (decayed && (flags & UPDATE_TG))
- update_tg_load_avg(cfs_rq, 0);
+ } else if (decayed) {
+ cfs_rq_util_change(cfs_rq, 0);
+
+ if (flags & UPDATE_TG)
+ update_tg_load_avg(cfs_rq, 0);
+ }
}

#ifndef CONFIG_64BIT
@@ -7441,6 +7442,7 @@ static void update_blocked_averages(int cpu)
const struct sched_class *curr_class;
struct rq_flags rf;
bool done = true;
+ int decayed = 0;

rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
@@ -7450,9 +7452,9 @@ static void update_blocked_averages(int cpu)
* that RT, DL and IRQ signals have been updated before updating CFS.
*/
curr_class = rq->curr->sched_class;
- update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
- update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
- update_irq_load_avg(rq, 0);
+ decayed |= update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+ decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
+ decayed |= update_irq_load_avg(rq, 0);

/* Don't need periodic decay once load/util_avg are null */
if (others_have_blocked(rq))
@@ -7486,6 +7488,9 @@ static void update_blocked_averages(int cpu)
}

update_blocked_load_status(rq, !done);
+
+ if (decayed)
+ cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, &rf);
}

@@ -7542,6 +7547,7 @@ static inline void update_blocked_averages(int cpu)
struct cfs_rq *cfs_rq = &rq->cfs;
const struct sched_class *curr_class;
struct rq_flags rf;
+ int decayed = 0;

rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
@@ -7551,13 +7557,16 @@ static inline void update_blocked_averages(int cpu)
* that RT, DL and IRQ signals have been updated before updating CFS.
*/
curr_class = rq->curr->sched_class;
- update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
- update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
- update_irq_load_avg(rq, 0);
+ decayed |= update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+ decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
+ decayed |= update_irq_load_avg(rq, 0);

- update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+ decayed |= update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);

update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
+
+ if (decayed)
+ cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, &rf);
}

--
2.7.4


>
> Regards,
> Vincent
>
>
> >
> > ... Doug
> >
> >