Re: [PATCH v2 1/2] sched/fair: Take thermal pressure into account while estimating energy

From: Vincent Guittot
Date: Thu Jun 10 2021 - 04:01:41 EST


On Fri, 4 Jun 2021 at 10:10, Lukasz Luba <lukasz.luba@xxxxxxx> wrote:
>
> Energy Aware Scheduling (EAS) needs to be able to predict the frequency
> requests made by the SchedUtil governor to properly estimate energy used
> in the future. It has to take into account CPUs utilization and forecast
> Performance Domain (PD) frequency. There is a corner case when the max
> allowed frequency might be reduced due to thermal. SchedUtil is aware of
> that reduced frequency, so it should be taken into account also in EAS
> estimations.
>
> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
> to 'policy::max'. SchedUtil is responsible to respect that upper limit
> while setting the frequency through CPUFreq drivers. This effective
> frequency is stored internally in 'sugov_policy::next_freq' and EAS has
> to predict that value.
>
> In the existing code the raw value of arch_scale_cpu_capacity() is used
> for clamping the returned CPU utilization from effective_cpu_util().
> This patch fixes issue with too big single CPU utilization, by introducing
> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
> capacity reduced by thermal pressure signal. We rely on this load avg
> geometric series in similar way as other mechanisms in the scheduler.
>
> Thanks to knowledge about allowed CPU capacity, we don't get too big value
> for a single CPU utilization, which is then added to the util sum. The
> util sum is used as a source of information for estimating whole PD energy.
> To avoid wrong energy estimation in EAS (due to capped frequency), make
> sure that the calculation of util sum is aware of allowed CPU capacity.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@xxxxxxx>
> ---
> kernel/sched/fair.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 161b92aa1c79..1aeddecabc20 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6527,6 +6527,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> struct cpumask *pd_mask = perf_domain_span(pd);
> unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> unsigned long max_util = 0, sum_util = 0;
> + unsigned long _cpu_cap = cpu_cap;
> int cpu;
>
> /*
> @@ -6558,14 +6559,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> cpu_util_next(cpu, p, -1) + task_util_est(p);
> }
>
> + /*
> + * Take the thermal pressure from non-idle CPUs. They have
> + * most up-to-date information. For idle CPUs thermal pressure
> + * signal is not updated so often.

What do you mean by "not updated so often" ? Do you have a value ?

Thermal pressure is updated at the same rate as other PELT values of
an idle CPU. Why is it a problem there ?

> + */
> + if (!idle_cpu(cpu))
> + _cpu_cap = cpu_cap - thermal_load_avg(cpu_rq(cpu));
> +
> /*
> * Busy time computation: utilization clamping is not
> * required since the ratio (sum_util / cpu_capacity)
> * is already enough to scale the EM reported power
> * consumption at the (eventually clamped) cpu_capacity.
> */
> - sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
> - ENERGY_UTIL, NULL);
> + cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
> + ENERGY_UTIL, NULL);
> +
> + sum_util += min(cpu_util, _cpu_cap);
>
> /*
> * Performance domain frequency: utilization clamping
> @@ -6576,7 +6587,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> */
> cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
> FREQUENCY_UTIL, tsk);
> - max_util = max(max_util, cpu_util);
> + max_util = max(max_util, min(cpu_util, _cpu_cap));
> }
>
> return em_cpu_energy(pd->em_pd, max_util, sum_util);
> --
> 2.17.1
>