Re: [PATCH v2 1/2] sched/fair: Take thermal pressure into account while estimating energy

From: Vincent Guittot
Date: Thu Jun 10 2021 - 05:12:38 EST


On Thu, 10 Jun 2021 at 10:42, Lukasz Luba <lukasz.luba@xxxxxxx> wrote:
>
>
>
> On 6/10/21 8:59 AM, Vincent Guittot wrote:
> > On Fri, 4 Jun 2021 at 10:10, Lukasz Luba <lukasz.luba@xxxxxxx> wrote:
> >>
> >> Energy Aware Scheduling (EAS) needs to be able to predict the frequency
> >> requests made by the SchedUtil governor to properly estimate energy used
> >> in the future. It has to take into account CPUs utilization and forecast
> >> Performance Domain (PD) frequency. There is a corner case when the max
> >> allowed frequency might be reduced due to thermal. SchedUtil is aware of
> >> that reduced frequency, so it should be taken into account also in EAS
> >> estimations.
> >>
> >> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
> >> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
> >> to 'policy::max'. SchedUtil is responsible to respect that upper limit
> >> while setting the frequency through CPUFreq drivers. This effective
> >> frequency is stored internally in 'sugov_policy::next_freq' and EAS has
> >> to predict that value.
> >>
> >> In the existing code the raw value of arch_scale_cpu_capacity() is used
> >> for clamping the returned CPU utilization from effective_cpu_util().
> >> This patch fixes issue with too big single CPU utilization, by introducing
> >> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
> >> capacity reduced by thermal pressure signal. We rely on this load avg
> >> geometric series in similar way as other mechanisms in the scheduler.
> >>
> >> Thanks to knowledge about allowed CPU capacity, we don't get too big value
> >> for a single CPU utilization, which is then added to the util sum. The
> >> util sum is used as a source of information for estimating whole PD energy.
> >> To avoid wrong energy estimation in EAS (due to capped frequency), make
> >> sure that the calculation of util sum is aware of allowed CPU capacity.
> >>
> >> Signed-off-by: Lukasz Luba <lukasz.luba@xxxxxxx>
> >> ---
> >> kernel/sched/fair.c | 17 ++++++++++++++---
> >> 1 file changed, 14 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 161b92aa1c79..1aeddecabc20 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6527,6 +6527,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> >> struct cpumask *pd_mask = perf_domain_span(pd);
> >> unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> >> unsigned long max_util = 0, sum_util = 0;
> >> + unsigned long _cpu_cap = cpu_cap;
> >> int cpu;
> >>
> >> /*
> >> @@ -6558,14 +6559,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> >> cpu_util_next(cpu, p, -1) + task_util_est(p);
> >> }
> >>
> >> + /*
> >> + * Take the thermal pressure from non-idle CPUs. They have
> >> + * most up-to-date information. For idle CPUs thermal pressure
> >> + * signal is not updated so often.
> >
> > What do you mean by "not updated so often" ? Do you have a value ?
> >
> > Thermal pressure is updated at the same rate as other PELT values of
> > an idle CPU. Why is it a problem there ?
> >
>
>
> For idle CPU the value is updated 'remotely' by some other CPU
> running nohz_idle_balance(). That goes into
> update_blocked_averages() if the flags and checks are OK inside
> update_nohz_stats(). Sometimes this is not called
> because other_have_blocked() returned false. It can happen for a long

So i miss that you were in a loop and the below was called for each
cpu and _cpu_cap was overwritten

+ if (!idle_cpu(cpu))
+ _cpu_cap = cpu_cap - thermal_load_avg(cpu_rq(cpu));

But that also means that if the 1st cpus of the pd are idle, they will
use original capacity whereas the other ones will remove the thermal
pressure. Isn't this a problem ? You don't use the same capacity for
all cpus in the performance domain regarding the thermal pressure?

> idle CPU, which all signals in that function has 0 [1].
>
> This will cause that we don't check what is a new value stored by
> thermal cpufreq_cooling for the thermal pressure [2]. We should feed
> that value into the 'signal' machinery inside the
> __update_blocked_others() [3]. Unfortunately, in a corner case there's
> a flag (rq->has_blocked_load) which blocks the check of a
> raw thermal value and prevents feeding it into thermal pressure signal
> (since it's a long idle CPU, there is no load) [4].
>
> It has implication on this patch, because I cannot e.g. take first
> CPU from the PD mask and blindly check it's thermal pressure,
> because it can be idle for a long time. I don't want to have two
> loop, first just for taking the latest thermal pressure for the PD.
> Thus, I want to re-use the existing loop to take the latest information
> from non-idle CPU and pass use.
>
> Regards,
> Lukasz
>
>
> [1] https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c#L7909
> [2]
> https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpufreq_cooling.c#L494
> [3] https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c#L7958
> [4] https://elixir.bootlin.com/linux/latest/source/kernel/sched/fair.c#L8433