Re: [PATCH v5] sched: Consolidate cpufreq updates

From: Christian Loehle
Date: Wed Jun 05 2024 - 09:09:27 EST


On 5/30/24 11:46, Qais Yousef wrote:
> Improve the interaction with cpufreq governors by making the
> cpufreq_update_util() calls more intentional.
>
> At the moment we send them when load is updated for CFS, bandwidth for
> DL and at enqueue/dequeue for RT. But this can lead to too many updates
> sent in a short period of time and potentially be ignored at a critical
> moment due to the rate_limit_us in schedutil.
>
> For example, simultaneous task enqueue on the CPU where 2nd task is
> bigger and requires higher freq. The trigger to cpufreq_update_util() by
> the first task will lead to dropping the 2nd request until tick. Or
> another CPU in the same policy triggers a freq update shortly after.
>
> Updates at enqueue for RT are not strictly required. Though they do help
> to reduce the delay for switching the frequency and the potential
> observation of lower frequency during this delay. But current logic
> doesn't intentionally (at least to my understanding) try to speed up the
> request.
>
> To help reduce the amount of cpufreq updates and make them more
> purposeful, consolidate them into these locations:
>
> 1. context_switch()
> 2. task_tick_fair()
> 3. update_blocked_averages()
> 4. on syscall that changes policy or uclamp values
>
> The update at context switch should help guarantee that DL and RT get
> the right frequency straightaway when they're RUNNING. As mentioned
> though the update will happen slightly after enqueue_task(); though in
> an ideal world these tasks should be RUNNING ASAP and this additional
> delay should be negligible.

Do we care at all about PREEMPT_NONE (and voluntary) here? I assume no.
Anyway one scenario that should regress when we don't update at RT enqueue:
(Essentially means that util of higher prio dominates over lower, if
higher is enqueued first.)
System:
OPP 0, cap: 102, 100MHz; OPP 1, cap: 1024, 1000MHz
RT task A prio=0 runtime@OPP1=1ms, uclamp_min=0; RT task B prio=1 runtime@OPP1=1ms, uclamp_min=1024
rate_limit_us = freq transition delay = 1 (assume basically instant switch)
Let's say CONFIG_HZ=100 for the tick to not get in the way, doesn't really matter.

Before:
t+0: Enqueue task A switch to OPP0
Running A at OPP 0
t+2us: Enqueue task B switch to OPP1
t+1000us: Task A done, switch to task B.
t+2000us: Task B done

Now:
t+0: Enqueue task A switch to OPP0
Running A at OPP 0
t+2us: Enqueue task B
t+10000us: Task A done, switch to task B and OPP1
t+11000us: Task B done

Or am I missing something?

Kind Regards,
Christian

> [snip]