[PATCH] sched/fair: schedutil: update only with all info available

From: Patrick Bellasi
Date: Fri Apr 06 2018 - 13:28:51 EST

Schedutil is not properly updated when the first FAIR task wakes up on a
CPU and when a RQ is (un)throttled. This is mainly due to the current
integration strategy, which relies on updates being triggered implicitly
each time a cfs_rq's utilization is updated.

Those updates are currently provided (mainly) via
which is used in:
- update_cfs_rq_load_avg()
when the utilization of a cfs_rq is updated
- {attach,detach}_entity_load_avg()
This is done based on the idea that "we should callback schedutil
frequently enough" to properly update the CPU frequency at every
utilization change.

Since this recent schedutil update:

commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")

we use additional RQ information to properly account for FAIR tasks
utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
in sugov_aggregate_util() to sum up the cfs_rq's utilization.

However, cfs_rq::h_nr_running is usually updated as:

cfs_rq_util_change ==> trigger schedutil update
cfs_rq->h_nr_running += number_of_tasks

both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
A similar pattern is used also in dequeue_task_fair() and
throttle_cfs_rq() to remove tasks.

This means that we are likely to see a zero cfs_rq utilization when we
enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
instead, for example, we are throttling all the FAIR tasks of a CPU.

While the second issue is less important, since we are less likely to
reduce frequency when CPU utilization decreases, the first issue can
instead impact performance. Indeed, we potentially introduce a not desired
latency between a task enqueue on a CPU and its frequency increase.

Another possible unwanted side effect is the iowait boosting of a CPU
when we enqueue a task into a throttled cfs_rq.

Moreover, the current schedutil integration has these other downsides:

- schedutil updates are triggered by RQ's load updates, which makes
sense in general but it does not allow to know exactly which other RQ
related information has been updated (e.g. h_nr_running).

- increasing the chances to update schedutil does not always correspond
to provide the most accurate information for a proper frequency
selection, thus we can skip some updates.

- we don't know exactly at which point a schedutil update is triggered,
and thus potentially a frequency change started, and that's because
the update is a side effect of cfs_rq_util_changed instead of an
explicit call from the most suitable call path.

- cfs_rq_util_change() is mainly a wrapper function for an already
existing "public API", cpufreq_update_util(), to ensure we actually
update schedutil only when we are updating a root RQ. Thus, especially
when task groups are in use, most of the calls to this wrapper
function are really not required.

- the usage of a wrapper function is not completely consistent across
fair.c, since we still need sometimes additional explicit calls to
cpufreq_update_util(), for example to support the IOWAIT boot flag in
the wakeup path

- it makes it hard to integrate new features since it could require to
change other function prototypes just to pass in an additional flag,
as it happened for example here:

commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")

All the above considered, let's try to make schedutil updates more
explicit in fair.c by:

- removing the cfs_rq_util_change() wrapper function to use the
cpufreq_update_util() public API only when root cfs_rq is updated

- remove indirect and side-effect (sometimes not required) schedutil
updates when the cfs_rq utilization is updated

- call cpufreq_update_util() explicitly in the few call sites where it
really makes sense and all the required information has been updated

By doing so, this patch mainly removes code and adds explicit calls to
schedutil only when we:
- {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
- (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
- task_tick_fair() to update the utilization of the root cfs_rq

All the other code paths, currently _indirectly_ covered by a call to
update_load_avg(), are also covered by the above three calls.
Some already imply enqueue/dequeue calls:
- switch_{to,from}_fair()
- sched_move_task()
or are followed by enqueue/dequeue calls:
- cpu_cgroup_fork() and
are used at wakeup_new_task() time and thus already followed by an
- migrate_task_rq_fair():
updates the removed utilization but not the actual cfs_rq
utilization, which is updated by a following sched event

This new proposal allows also to better aggregate schedutil related
flags, which are required only at enqueue_task_fair() time.
Indeed, IOWAIT and MIGRATION flags are now requested only when a task is
actually visible at the root cfs_rq level.

Signed-off-by: Patrick Bellasi <patrick.bellasi@xxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
Cc: Viresh Kumar <viresh.kumar@xxxxxxxxxx>
Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: linux-pm@xxxxxxxxxxxxxxx


The SCHED_CPUFREQ_MIGRATION flags, recently introduced by:

ea14b57e8a18 sched/cpufreq: Provide migration hint

is maintained although there are not actual usages so far in mainline
for this hint... do we really need it?
kernel/sched/fair.c | 84 ++++++++++++++++++++++++-----------------------------
1 file changed, 38 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0951d1c58d2f..e726f91f0089 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
* For !fair tasks do:
update_cfs_rq_load_avg(now, cfs_rq);
- attach_entity_load_avg(cfs_rq, se, 0);
+ attach_entity_load_avg(cfs_rq, se);
switched_from_fair(rq, p);
* such that the next switched_to_fair() has the
@@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)

-static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
- struct rq *rq = rq_of(cfs_rq);
- if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
- /*
- * There are a few boundary cases this might miss but it should
- * get called often enough that that should (hopefully) not be
- * a real problem.
- *
- * It will not get called when we go idle, because the idle
- * thread is a different class (!fair), nor will the utilization
- * number include things like RT tasks.
- *
- * As is, the util number is not freq-invariant (we'd have to
- * implement arch_scale_freq_capacity() for that).
- *
- * See cpu_util().
- */
- cpufreq_update_util(rq, flags);
- }
* Approximate:
@@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
cfs_rq->load_last_update_time_copy = sa->last_update_time;

- if (decayed)
- cfs_rq_util_change(cfs_rq, 0);
return decayed;

@@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
* Must call update_cfs_rq_load_avg() before this, since we rely on
* cfs_rq->avg.last_update_time being current.
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;

@@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);

- cfs_rq_util_change(cfs_rq, flags);

@@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);

- cfs_rq_util_change(cfs_rq, 0);

@@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
* IOW we're enqueueing a task on a new CPU.
- attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
+ attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq, 0);

} else if (decayed && (flags & UPDATE_TG))
@@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
- cfs_rq_util_change(cfs_rq, 0);

static inline void remove_entity_load_avg(struct sched_entity *se) {}

static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
+attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void
detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}

@@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue = 0;

- if (!se)
+ /* The tasks are no more visible from the root cfs_rq */
+ if (!se) {
sub_nr_running(rq, task_delta);
+ cpufreq_update_util(rq, 0);
+ }

cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq);
@@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

- if (!se)
+ /* The tasks are now visible from the root cfs_rq */
+ if (!se) {
add_nr_running(rq, task_delta);
+ cpufreq_update_util(rq, 0);
+ }

/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
@@ -5356,14 +5333,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

- /*
- * If in_iowait is set, the code below may not trigger any cpufreq
- * utilization updates, so do it here explicitly with the IOWAIT flag
- * passed.
- */
- if (p->in_iowait)
- cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
for_each_sched_entity(se) {
if (se->on_rq)
@@ -5394,9 +5363,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

- if (!se)
+ /* The task is visible from the root cfs_rq */
+ if (!se) {
+ unsigned int flags = 0;
add_nr_running(rq, 1);

+ if (p->in_iowait)
+ /*
+ * !last_update_time means we've passed through
+ * migrate_task_rq_fair() indicating we migrated.
+ *
+ * IOW we're enqueueing a task on a new CPU.
+ */
+ if (!p->se.avg.last_update_time)
+ cpufreq_update_util(rq, flags);
+ }
util_est_enqueue(&rq->cfs, p);
@@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

- if (!se)
+ /* The task is no more visible from the root cfs_rq */
+ if (!se) {
sub_nr_running(rq, 1);
+ cpufreq_update_util(rq, 0);
+ }

util_est_dequeue(&rq->cfs, p, task_sleep);
@@ -9950,6 +9940,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+ cpufreq_update_util(rq, 0);

@@ -10087,7 +10079,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)

/* Synchronize entity with its cfs_rq */
update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
- attach_entity_load_avg(cfs_rq, se, 0);
+ attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq, false);