Re: [PATCH 2/2] cpufreq: schedutil: Ignore rate limit when scaling up with FIE present

From: Christian Loehle
Date: Thu Dec 12 2024 - 07:06:37 EST

Next message: Ojaswin Mujoo: "Re: [RFC 3/3] xfs_io: add extsize command support"
Previous message: Quentin Perret: "Re: [PATCH v2 17/18] KVM: arm64: Introduce the EL1 pKVM MMU"
In reply to: Sultan Alsawaf (unemployed): "Re: [PATCH 1/2] cpufreq: schedutil: Fix superfluous updates caused by need_freq_update"
Next in thread: Sultan Alsawaf (unemployed): "Re: [PATCH 2/2] cpufreq: schedutil: Ignore rate limit when scaling up with FIE present"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/12/24 01:57, Sultan Alsawaf wrote:
> From: "Sultan Alsawaf (unemployed)" <sultan@xxxxxxxxxxxxxxx>

Hi Sultan,
(Adding further CCs that might be interested in this)
Good to see some input here again, if it becomes a regular thing,
which I would welcome, you might want to look into git send-email
or b4 relay, at least in my inbox this series looks strange.
Also no cover letter.

>
> When schedutil disregards a frequency transition due to the transition rate
> limit, there is no guaranteed deadline as to when the frequency transition
> will actually occur after the rate limit expires. For instance, depending
> on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited
> frequency transition may be delayed indefinitely, until said CPU reaches
> the scheduler again. This also hurts tasks boosted via UCLAMP_MIN.

Realistically this will be the tick at worst for the systems you care about,
I assume.

>
> For frequency transitions _down_, this only poses a theoretical loss of
> energy savings since a CPU may remain at a higher frequency than necessary
> for an indefinite period beyond the rate limit expiry.

theoretical?

>
> For frequency transitions _up_, however, this poses a significant hit to
> performance when a CPU is stuck at an insufficient frequency for an
> indefinitely long time. In latency-sensitive and bursty workloads
> especially, a missed frequency transition up can result in a significant
> performance loss due to a CPU operating at an insufficient frequency for
> too long.

The term significant implies you have some numbers, please go ahead and
share them then.

I'm not sure if you're aware of Qais' series that also intends to ignore the
rate-limit (in certain cases).
https://lore.kernel.org/lkml/20240728184551.42133-1-qyousef@xxxxxxxxxxx/
I would mostly agree with your below argument regarding FIE and did lean
towards dropping it altogether. The main issue I had with Qais' series
was !fast_switch platforms and the resulting preemption by the DL task
(Often on a remote, possibly idle CPU of the same perf-domain).
Unlimited frequency updates can really hurt both power and throughput there.

Fortunately given the low-pass filter nature of PELT, without some special
workloads, most calls to cpufreq_update_util() are dropped because there
is no resulting frequency change anyway.

You keeping the rate-limit when reducing the frequency might be enough to
work around the issue though. I will run some experiments like I did for
Qais' series.

I'll also go and check for unintended changes in iowait boost (that
interacts with the rate-limit too).

>
> When support for the Frequency Invariant Engine (FIE) _isn't_ present, a
> rate limit is always required for the scheduler to compute CPU utilization
> with some semblance of accuracy: any frequency transition that occurs
> before the previous transition latches would result in the scheduler not
> knowing the frequency a CPU is actually operating at, thereby trashing the
> computed CPU utilization.
>
> However, when FIE support _is_ present, there's no technical requirement to
> rate limit all frequency transitions to a cpufreq driver's reported
> transition latency. With FIE, the scheduler's CPU utilization tracking is
> unaffected by any frequency transitions that occur before the previous
> frequency is latched.
>
> Therefore, ignore the frequency transition rate limit when scaling up on
> systems where FIE is present. This guarantees that transitions to a higher
> frequency cannot be indefinitely delayed, since they simply cannot be
> delayed at all.
>
> Signed-off-by: Sultan Alsawaf (unemployed) <sultan@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/cpufreq_schedutil.c | 35 +++++++++++++++++++++++++++-----
> 1 file changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e51d5ce730be..563baab89a24 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -59,10 +59,15 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
> /************************ Governor internals ***********************/
>
> -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> +static bool sugov_should_rate_limit(struct sugov_policy *sg_policy, u64 time)
> {
> - s64 delta_ns;
> + s64 delta_ns = time - sg_policy->last_freq_update_time;
> +
> + return delta_ns < sg_policy->freq_update_delay_ns;
> +}
>
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> +{
> /*
> * Since cpufreq_update_util() is called with rq->lock held for
> * the @target_cpu, our per-CPU data is fully serialized.
> @@ -87,17 +92,37 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> return true;
> }
>
> - delta_ns = time - sg_policy->last_freq_update_time;
> + /*
> + * When frequency-invariant utilization tracking is present, there's no
> + * rate limit when increasing frequency. Therefore, the next frequency
> + * must be determined before a decision can be made to rate limit the
> + * frequency change, hence the rate limit check is bypassed here.
> + */
> + if (arch_scale_freq_invariant())
> + return true;
>
> - return delta_ns >= sg_policy->freq_update_delay_ns;
> + return !sugov_should_rate_limit(sg_policy, time);
> }
>
> static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> unsigned int next_freq)
> {
> + /*
> + * When a frequency update isn't mandatory (!need_freq_update), the rate
> + * limit is checked again upon frequency reduction because systems with
> + * frequency-invariant utilization bypass the rate limit check entirely
> + * in sugov_should_update_freq(). This is done so that the rate limit
> + * can be applied only for frequency reduction when frequency-invariant
> + * utilization is present. Now that the next frequency is known, the
> + * rate limit can be selectively applied to frequency reduction on such
> + * systems. A check for arch_scale_freq_invariant() is omitted here
> + * because unconditionally rechecking the rate limit is cheaper.
> + */
> if (sg_policy->need_freq_update)
> sg_policy->need_freq_update = false;
> - else if (sg_policy->next_freq == next_freq)
> + else if (next_freq == sg_policy->next_freq ||
> + (next_freq < sg_policy->next_freq &&
> + sugov_should_rate_limit(sg_policy, time)))
> return false;
>
> sg_policy->next_freq = next_freq;

Code seems to match your description FWIW.
Maybe the comments could be trimmed somewhat.

Regards,
Christian

Next message: Ojaswin Mujoo: "Re: [RFC 3/3] xfs_io: add extsize command support"
Previous message: Quentin Perret: "Re: [PATCH v2 17/18] KVM: arm64: Introduce the EL1 pKVM MMU"
In reply to: Sultan Alsawaf (unemployed): "Re: [PATCH 1/2] cpufreq: schedutil: Fix superfluous updates caused by need_freq_update"
Next in thread: Sultan Alsawaf (unemployed): "Re: [PATCH 2/2] cpufreq: schedutil: Ignore rate limit when scaling up with FIE present"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]