Re: [PATCH 3/3] sched/fair: schedutil: explicit update only when required
From: Joel Fernandes
Date: Sun May 13 2018 - 02:25:48 EST
On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> > Schedutil updates for FAIR tasks are triggered implicitly each time a
> > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
> > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
> > changed, and {attach,detach}_entity_load_avg().
> >
> > This design is based on the idea that "we should callback schedutil
> > frequently enough" to properly update the CPU frequency at every
> > utilization change. However, such an integration strategy has also
> > some downsides:
>
> Hi Patrick,
>
> I agree making the call explicit would make schedutil integration easier so
> that's really awesome. However I also fear that if some path in the fair
> class in the future changes the utilization but forgets to update schedutil
> explicitly (because they forgot to call the explicit public API) then the
> schedutil update wouldn't go through. In this case the previous design of
> doing the schedutil update in the wrapper kind of was a nice to have
>
> Just thinking out loud but is there a way you could make the implicit call
> anyway incase the explicit call wasn't requested for some reason? That's
> probably hard to do correctly though..
>
> Some more comments below:
>
> >
> > - schedutil updates are triggered by RQ's load updates, which makes
> > sense in general but it does not allow to know exactly which other RQ
> > related information have been updated.
> > Recently, for example, we had issues due to schedutil dependencies on
> > cfs_rq->h_nr_running and estimated utilization updates.
> >
> > - cfs_rq_util_change() is mainly a wrapper function for an already
> > existing "public API", cpufreq_update_util(), which is required
> > just to ensure we actually update schedutil only when we are updating
> > a root cfs_rq.
> > Thus, especially when task groups are in use, most of the calls to
> > this wrapper function are not required.
> >
> > - the usage of a wrapper function is not completely consistent across
> > fair.c, since we could still need additional explicit calls to
> > cpufreq_update_util().
> > For example this already happens to report the IOWAIT boot flag in
> > the wakeup path.
> >
> > - it makes it hard to integrate new features since it could require to
> > change other function prototypes just to pass in an additional flag,
> > as it happened for example in commit:
> >
> > ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
> >
> > All the above considered, let's make schedutil updates more explicit in
> > fair.c by removing the cfs_rq_util_change() wrapper function in favour
> > of the existing cpufreq_update_util() public API.
> > This can be done by calling cpufreq_update_util() explicitly in the few
> > call sites where it really makes sense and when all the (potentially)
> > required cfs_rq's information have been updated.
> >
> > This patch mainly removes code and adds explicit schedutil updates
> > only when we:
> > - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> > - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> > - task_tick_fair() to update the utilization of the root cfs_rq
> >
> > All the other code paths, currently _indirectly_ covered by a call to
> > update_load_avg(), are still covered. Indeed, some paths already imply
> > enqueue/dequeue calls:
> > - switch_{to,from}_fair()
> > - sched_move_task()
> > while others are followed by enqueue/dequeue calls:
> > - cpu_cgroup_fork() and
> > post_init_entity_util_avg():
> > are used at wakeup_new_task() time and thus already followed by an
> > enqueue_task_fair()
> > - migrate_task_rq_fair():
> > updates the removed utilization but not the actual cfs_rq
> > utilization, which is updated by a following sched event
> >
> > This new proposal allows also to better aggregate schedutil related
> > flags, which are required only at enqueue_task_fair() time.
> > IOWAIT and MIGRATION flags are now requested only when a task is
> > actually visible at the root cfs_rq level.
> >
> > Signed-off-by: Patrick Bellasi <patrick.bellasi@xxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> > Cc: Viresh Kumar <viresh.kumar@xxxxxxxxxx>
> > Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
> > Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Cc: linux-pm@xxxxxxxxxxxxxxx
> >
> > ---
> >
> > NOTE: this patch changes the behavior of the IOWAIT flag: in case of a
> > task waking up on a throttled RQ we do not assert the flag to schedutil
> > anymore. However, this seems to make sense since the task will not be
> > running anyway.
> > ---
> > kernel/sched/fair.c | 81 ++++++++++++++++++++++++-----------------------------
> > 1 file changed, 36 insertions(+), 45 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 01dfc47541e6..87f092151a6e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> > * For !fair tasks do:
> > *
> > update_cfs_rq_load_avg(now, cfs_rq);
> > - attach_entity_load_avg(cfs_rq, se, 0);
> > + attach_entity_load_avg(cfs_rq, se);
> > switched_from_fair(rq, p);
> > *
> > * such that the next switched_to_fair() has the
> > @@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> > }
> > #endif /* CONFIG_FAIR_GROUP_SCHED */
> >
> > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> > -{
> > - struct rq *rq = rq_of(cfs_rq);
> > -
> > - if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
> > - /*
> > - * There are a few boundary cases this might miss but it should
> > - * get called often enough that that should (hopefully) not be
> > - * a real problem.
> > - *
> > - * It will not get called when we go idle, because the idle
> > - * thread is a different class (!fair), nor will the utilization
> > - * number include things like RT tasks.
> > - *
> > - * As is, the util number is not freq-invariant (we'd have to
> > - * implement arch_scale_freq_capacity() for that).
> > - *
> > - * See cpu_util().
> > - */
> > - cpufreq_update_util(rq, flags);
> > - }
> > -}
> > -
> > #ifdef CONFIG_SMP
> > /*
> > * Approximate:
> > @@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > cfs_rq->load_last_update_time_copy = sa->last_update_time;
> > #endif
> >
> > - if (decayed)
> > - cfs_rq_util_change(cfs_rq, 0);
> > -
> > return decayed;
> > }
> >
> > @@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > * Must call update_cfs_rq_load_avg() before this, since we rely on
> > * cfs_rq->avg.last_update_time being current.
> > */
> > -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
> >
> > @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, flags);
> > }
> >
> > /**
> > @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, 0);
> > }
> >
> > /*
> > @@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > *
> > * IOW we're enqueueing a task on a new CPU.
> > */
> > - attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
> > + attach_entity_load_avg(cfs_rq, se);
> > update_tg_load_avg(cfs_rq, 0);
> >
> > } else if (decayed && (flags & UPDATE_TG))
> > @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> > {
> > - cfs_rq_util_change(cfs_rq, 0);
>
> How about kill that extra line by doing:
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq,
> struct sched_entity *se, int not_used1) {}
>
> > }
> >
> > static inline void remove_entity_load_avg(struct sched_entity *se) {}
> >
> > static inline void
> > -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
> > +attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> > static inline void
> > detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> >
> > @@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> > dequeue = 0;
> > }
> >
> > - if (!se)
> > + /* The tasks are no more visible from the root cfs_rq */
> > + if (!se) {
> > sub_nr_running(rq, task_delta);
> > + cpufreq_update_util(rq, 0);
> > + }
> >
> > cfs_rq->throttled = 1;
> > cfs_rq->throttled_clock = rq_clock(rq);
> > @@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > break;
> > }
> >
> > - if (!se)
> > + /* The tasks are now visible from the root cfs_rq */
> > + if (!se) {
> > add_nr_running(rq, task_delta);
> > + cpufreq_update_util(rq, 0);
> > + }
> >
> > /* Determine whether we need to wake up potentially idle CPU: */
> > if (rq->curr == rq->idle && rq->cfs.nr_running)
> > @@ -5359,14 +5336,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > /* Estimated utilization must be updated before schedutil */
> > util_est_enqueue(&rq->cfs, p);
> >
> > - /*
> > - * If in_iowait is set, the code below may not trigger any cpufreq
> > - * utilization updates, so do it here explicitly with the IOWAIT flag
> > - * passed.
> > - */
> > - if (p->in_iowait)
> > - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> > for_each_sched_entity(se) {
> > if (se->on_rq)
> > break;
> > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > - if (!se)
> > + /* The task is visible from the root cfs_rq */
> > + if (!se) {
> > + unsigned int flags = 0;
> > +
> > add_nr_running(rq, 1);
> >
> > + if (p->in_iowait)
> > + flags |= SCHED_CPUFREQ_IOWAIT;
> > +
> > + /*
> > + * !last_update_time means we've passed through
> > + * migrate_task_rq_fair() indicating we migrated.
> > + *
> > + * IOW we're enqueueing a task on a new CPU.
> > + */
> > + if (!p->se.avg.last_update_time)
> > + flags |= SCHED_CPUFREQ_MIGRATION;
> > +
> > + cpufreq_update_util(rq, flags);
> > + }
> > +
> > hrtick_update(rq);
> > }
> >
> > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > + /* The task is no more visible from the root cfs_rq */
> > if (!se)
> > sub_nr_running(rq, 1);
> >
> > util_est_dequeue(&rq->cfs, p, task_sleep);
> > + cpufreq_update_util(rq, 0);
>
> One question about this change. In enqueue, throttle and unthrottle - you are
> conditionally calling cpufreq_update_util incase the task was
> visible/not-visible in the hierarchy.
>
> But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> Is this because of util_est or something? Could you add a comment here
> explaining why this is so?
The big question I have is incase se != NULL, then its still visible at the
root RQ level. In that case should we still call the util_est_dequeue and the
cpufreq_update_util?
Sorry if I missed something obvious.
thanks!
- Joel