Re: [RFC PATCH 6/7] sched/cpufreq: Improve sugov_cpu_is_busy accuracy

From: Douglas Raillard
Date: Wed Jun 19 2019 - 12:24:58 EST

Hi Patrick,

On 5/16/19 1:55 PM, Patrick Bellasi wrote:
On 08-May 18:43, douglas.raillard@xxxxxxx wrote:
From: Douglas RAILLARD <douglas.raillard@xxxxxxx>

Avoid assuming a CPU is busy when it has begun being idle before
get_next_freq() is called. This is achieved by making sure the CPU will
not be detected as busy by other CPUs whenever its utilization is

If I understand it correctly, what you are after here is a "metric"
which tells you (in a shared performance domain) if a CPU has been
busy for a certain amount of time.
You do that by reworking the way idle_calls are accounted for the
sugov_update_single() case.

That approach could work but it looks a bit convoluted in the way it's
coded and it's also difficult to exclude there could be corner cases
with wired behaviors.
Isn't that why you "fix" the saved_idle_calls counter after all?

What about a different approach where we:

1. we annotate the timestamp a CPU wakes up from IDLE (last_wakeup_time)

2. we use that annotated last_wake_time and the rq->nr_running to
define the "cpu is busy" heuristic.

Looking at a sibling CPU, I think we can end up with two main cases:

1. CPU's nr_running is == 0
then we don't consider busy that CPU

2. CPU's nr_running is > 0
then the CPU is busy iff
(current_time - last_wakeup_tim) >= busy_threshold

Notice that, when a CPU is active, its rq clock is periodically
updated, at least once per tick. Thus, provided a tick time is not too
long to declare busy a CPU, then the above logic should work.

Perhaps the busy_threshold can also be defined considering the PELT
dynamics and starting from an expected utilization increase converted
in time.

After experimenting with quite a few combinations, I managed to get a heuristic
based on util_avg and util_est_enqueued that seems to work better in my case.
Key differences are:
* this new heuristic only really takes into account CFS signals
(current one based on idle calls takes into account everything that executes
on a given CPU.)
* it will mark a CPU as busy less often, since it should only trigger when
there is a change in the utilization of a currently enqueued tasks.
Util changes due to enqueue/dequeue will not trigger it, which IMHO
is desirable, since we only want to bias frequency selection
when we know that we don't have precise utilization values for the
enqueued tasks (because the task has changed its behavior).

That change will be part of v2 posting of this series.
Could something like to above work or am I missing something?

Signed-off-by: Douglas RAILLARD <douglas.raillard@xxxxxxx>
kernel/sched/cpufreq_schedutil.c | 42 ++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a12b7e5bc028..ce4b90cafbb5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -62,6 +62,7 @@ struct sugov_cpu {
/* The field below is for single-CPU policies only: */
unsigned long saved_idle_calls;
+ unsigned long previous_util;
@@ -181,14 +182,35 @@ static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
return ret;
-static void sugov_cpu_is_busy_update(struct sugov_cpu *sg_cpu)
+static void sugov_cpu_is_busy_update(struct sugov_cpu *sg_cpu,
+ unsigned long util)
unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
sg_cpu->saved_idle_calls = idle_calls;
+ /*
+ * Make sure that this CPU will not be immediately considered as busy in
+ * cases where the CPU has already entered an idle state. In that case,
+ * the number of idle_calls will not vary anymore until it exits idle,
+ * which would lead sugov_cpu_is_busy() to say that this CPU is busy,
+ * because it has not (re)entered idle since the last time we looked at
+ * it.
+ * Assuming cpu0 and cpu1 are in the same policy, that will make sure
+ * this sequence of events leads to right cpu1 business status from
+ * get_next_freq(cpu=1)
+ * cpu0: [enter idle] -> [get_next_freq] -> [doing nothing] -> [wakeup]
+ * cpu1: ... -> [get_next_freq] -> ...
+ */
+ if (util <= sg_cpu->previous_util)
+ sg_cpu->saved_idle_calls--;
+ sg_cpu->previous_util = util;
static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
-static void sugov_cpu_is_busy_update(struct sugov_cpu *sg_cpu) {}
+static void sugov_cpu_is_busy_update(struct sugov_cpu *sg_cpu
+ unsigned long util)
#endif /* CONFIG_NO_HZ_COMMON */
@@ -507,10 +529,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
if (!sugov_should_update_freq(sg_policy, time))
- busy = sugov_cpu_is_busy(sg_cpu);
- sugov_cpu_is_busy_update(sg_cpu);
util = sugov_get_util(sg_cpu);
+ busy = sugov_cpu_is_busy(sg_cpu);
+ sugov_cpu_is_busy_update(sg_cpu, util);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
next_f = get_next_freq(sg_policy, util, max);
@@ -545,12 +566,15 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util = 0, max = 1;
unsigned int j;
+ unsigned long sg_cpu_util = 0;
for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long j_util, j_max;
j_util = sugov_get_util(j_sg_cpu);
+ if (j_sg_cpu == sg_cpu)
+ sg_cpu_util = j_util;
j_max = j_sg_cpu->max;
j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
@@ -560,6 +584,14 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
+ /*
+ * Only update the business status if we are looking at the CPU for
+ * which a utilization change triggered a call to get_next_freq(). This
+ * way, we don't affect the "busy" status of CPUs that don't have any
+ * change in utilization.
+ */
+ sugov_cpu_is_busy_update(sg_cpu, sg_cpu_util);
return get_next_freq(sg_policy, util, max);