Re: [PATCH 3/3] sched/fair: schedutil: explicit update only when required

From: Vincent Guittot
Date: Tue May 15 2018 - 06:20:01 EST

On 14 May 2018 at 18:32, Patrick Bellasi <patrick.bellasi@xxxxxxx> wrote:
> On 12-May 23:25, Joel Fernandes wrote:
>> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
>> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
>> > > Schedutil updates for FAIR tasks are triggered implicitly each time a
>> > > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
>> > > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
>> > > changed, and {attach,detach}_entity_load_avg().
>> > >
>> > > This design is based on the idea that "we should callback schedutil
>> > > frequently enough" to properly update the CPU frequency at every
>> > > utilization change. However, such an integration strategy has also
>> > > some downsides:
>> >
>> > Hi Patrick,
> Hi Joel,


>> > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > > update_cfs_group(se);
>> > > }
>> > >
>> > > + /* The task is no more visible from the root cfs_rq */
>> > > if (!se)
>> > > sub_nr_running(rq, 1);
>> > >
>> > > util_est_dequeue(&rq->cfs, p, task_sleep);
>> > > + cpufreq_update_util(rq, 0);
>> >
>> > One question about this change. In enqueue, throttle and unthrottle - you are
>> > conditionally calling cpufreq_update_util incase the task was
>> > visible/not-visible in the hierarchy.
>> >
>> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
>> > Is this because of util_est or something? Could you add a comment here
>> > explaining why this is so?
>> The big question I have is incase se != NULL, then its still visible at the
>> root RQ level.
> My understanding it that you get !se at dequeue time when we are
> dequeuing a task from a throttled RQ. Isn't it?

Yes se becomes NULL only when you reach root domain

> Thus, this means you are dequeuing a throttled task, I guess for
> example because of a migration.
> However, the point is that a task dequeue from a throttled RQ _is
> already_ not visible from the root RQ, because of the sub_nr_running()
> done by throttle_cfs_rq().
>> In that case should we still call the util_est_dequeue and the
>> cpufreq_update_util?
> I had a better look at the different code paths and I've possibly come
> up with some interesting observations. Lemme try to resume theme here.
> First of all, we need to distinguish from estimated utilization
> updates and schedutil updates, since they respond to two very
> different goals.
> .:: Estimated utilization updates
> =================================
> Goal: account for the amount of utilization we are going to
> expect on a CPU
> At {en,de}queue time, util_est_{en,de}queue() is always
> unconditionally called because it tracks the utilization which is
> estimated to be generated by all the RUNNABLE tasks.
> We do not care about throttled/un-throttled RQ here because the effect
> of throttling is already folded into the estimated utilization.
> For example, a 100% tasks which is placed into a 50% bandwidth
> limited TG will generate a 50% (estimated) utilization. Thus, when the
> task is enqueued we can account immediately for that utilization
> although the RQ can be currently throttled.
> .:: Schedutil updates
> =====================
> Goal: select a better frequency, if and _when_ required
> At enqueue time, if the task is visible at the root RQ the it's
> expected to run within a scheduler latency period. Thus, it makes
> sense to call immediately schedutil to account for its estimated
> utilization to possibly increase the OPP.
> If instead the task is enqueued into a throttled RQ, then I'm
> skipping the update since the task will not run until the RQ is
> actually un-throttled.
> HOWEVER, I would say that in general we could skip this last
> optimization and always unconditionally update schedutil at enqueue
> time considering the fact that the effects of a throttled RQ are
> always reflected into the (estimated) utilization of a task.

I think so too

> At dequeue time instead, since we certainly removed some estimated
> utilization, then I unconditionally updated schedutil.
> HOWEVER, I was not considering these two things:
> 1. for a task going to sleep, we still have its blocked utilization
> accounted in the cfs_rq utilization.

It might be still interesting to reduce the frequency because the
blocked utilization can be lower than its estimated utilization.

> 2. for a task being migrated, at dequeue time we still have not
> removed the task's utilization from the cfs_rq's utilization.
> This usually happens later, for example we can have:
> move_queued_task()
> dequeue_task() --> CFS task dequeued
> set_task_cpu() --> schedutil updated
> migrate_task_rq_fair()
> detach_entity_cfs_rq()
> detach_entity_load_avg() --> CFS util removal
> enqueue_task()
> Moreover, the "CFS util removal" actually affects the cfs_rq only if
> we hold the RQ lock, otherwise we know that it's just back annotated
> as "removed" utilization and the actual cfs_rq utilization is fixed up
> at the next chance we have the RQ lock.
> Thus, I would say that in both cases, at dequeue time it does not make
> sense to update schedutil since we always see the task's utilization
> in the cfs_rq and thus we will not reduce the frequency.

Yes only attach/detach make sense from an utilization pov and that's
where we should check for a frequency update for utilization

> NOTE, this is true independently from the refactoring I'm proposing.
> At dequeue time, although we call update_load_avg() on the root RQ,
> it does not make sense to update schedutil since we still see either
> the blocked utilization of a sleeping task or the not yet removed
> utilization of a migrating task. In both cases the risk is to ask for
> an higher OPP right when a CPU is going to be IDLE.

We have to take care of not mixing the opportunity to update the
frequency when we are updating the utilization with the policy that we
want to apply regarding (what we think that is) the best time to
update the frequency. Like saying that we should wait a bit more to
make sure that the current utilization is sustainable because a
frequency change is expensive on the platform (or not)

It's not because a task is dequeued that we should not update and
increase the frequency; Or even that we should not decrease it because
we have just taken into account some removed utilization of a previous
The same happen when a task migrates, we don't know if the utilization
that is about to be migrated, will be higher or lower than the normal
update of the utilization (since the last update) and can not generate
a frequency change

I see your explanation above like a kind of policy where you want to
balance the cost of a frequency change with the probability that we
will not have to re-update the frequency soon.

I agree that some scheduling events give higher chances of a
sustainable utilization level and we should favor these events when
the frequency change is costly but I'm not sure that we should remove
all other opportunity to udjust the frequency to the current
utilization level when the cost is low or negligible.

Can't we classify the utilization events into some kind of major and
minor changes ?

> Moreover, it seems that in general we prefer a "conservative" approach
> in frequency reduction.
> For example it could be harmful to trigger a frequency reduction when
> a task is migrating off a CPU, if right after another task should be
> instead migrated into the same CPU.
> .:: Conclusions
> ===============
> All that considered, I think I've convinced myself that we really need
> to notify schedutil only in these cases:
> 1. enqueue time
> because of the changes in estimated utilization and the
> possibility to just straight to a better OPP
> 2. task tick time
> because of the possible ramp-up of the utilization
> Another case is related to remote CPUs blocked utilization update,
> after the recent Vincent's patches. Currently indeed:
> update_blocked_averages()
> update_load_avg()
> --> update schedutil
> and thus, potentially we wake up an IDLE cluster just to reduce its
> OPP. If the cluster is in a deep idle state, I'm not entirely sure
> this is good from an energy saving standpoint.
> However, with the patch I'm proposing we are missing that support,
> meaning that an IDLE cluster will get its utilization decayed but we
> don't wake it up just to drop its frequency.

So more than deciding in the scheduler if we should wake it up or not,
we should give a chance to cpufreq to decide if it wants to update the
frequency or not as this decision is somehow platform specific: cost
of frequency change, clock topology and shared clock, voltage topology

> Perhaps we should better pass in this information to schedutil via a
> flag (e.g. SCHED_FREQ_REMOTE_UPDATE) and implement there a policy to
> decide if and when it makes sense to drop the OPP. Or otherwise find a
> way for the special DL tasks to always run on the lower capacity_orig
> CPUs.
>> Sorry if I missed something obvious.
> Thanks for the question it has actually triggered a better analysis of
> what we have and what we need.
> Looking forward to some feedbacks about the above before posting a new
> version of this last patch.
>> thanks!
>> - Joel
> --
> #include <best/regards.h>
> Patrick Bellasi