Re: [PATCH v7] sched: Consolidate cpufreq updates

From: Christian Loehle
Date: Fri Oct 11 2024 - 05:34:41 EST


On 7/28/24 19:45, Qais Yousef wrote:
> Improve the interaction with cpufreq governors by making the
> cpufreq_update_util() calls more intentional.
>
> At the moment we send them when load is updated for CFS, bandwidth for
> DL and at enqueue/dequeue for RT. But this can lead to too many updates
> sent in a short period of time and potentially be ignored at a critical
> moment due to the rate_limit_us in schedutil.
>
> For example, simultaneous task enqueue on the CPU where 2nd task is
> bigger and requires higher freq. The trigger to cpufreq_update_util() by
> the first task will lead to dropping the 2nd request until tick. Or
> another CPU in the same policy triggers a freq update shortly after.
>
> Updates at enqueue for RT are not strictly required. Though they do help
> to reduce the delay for switching the frequency and the potential
> observation of lower frequency during this delay. But current logic
> doesn't intentionally (at least to my understanding) try to speed up the
> request.
>
> To help reduce the amount of cpufreq updates and make them more
> purposeful, consolidate them into these locations:
>
> 1. context_switch()
> 2. task_tick_fair()
> 3. sched_balance_update_blocked_averages()
> 4. on sched_setscheduler() syscall that changes policy or uclamp values
> 5. on check_preempt_wakeup_fair() if wakeup preemption failed
> 6. on __add_running_bw() to guarantee DL bandwidth requirements.
>
> The update at context switch should help guarantee that RT get the right
> frequency straightaway when they're RUNNING. As mentioned though the
> update will happen slightly after enqueue_task(); though in an ideal
> world these tasks should be RUNNING ASAP and this additional delay
> should be negligible. For fair tasks we need to make sure we send
> a single update for every decay for the root cfs_rq. Any changes to the
> rq will be deferred until the next task is ready to run, or we hit TICK.
> But we are guaranteed the task is running at a level that meets its
> requirements after enqueue.
>
> To guarantee RT and DL tasks updates are never missed, we add a new
> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> already running at the right freq, the governor will end up doing
> nothing, but we eliminate the risk of the task ending up accidentally
> running at the wrong freq due to rate_limit_us.
>
> Similarly for iowait boost, we ignore rate limits. We also handle a case
> of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> to reduce the boost after 1ms which seems iowait boost mechanism relied
> on rate_limit_us and cfs_rq.decayed preventing any updates to happen
> soon after iowait boost.
>
> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> time stamps otherwise we can end up delaying updates for normal
> requests.
>
> As a simple optimization, we avoid sending cpufreq updates when
> switching from RT to another RT as RT tasks run at max freq by default.
> If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
> uclamp_min is different to avoid unnecessary cpufreq update as most RT
> tasks are likely to be running at the same performance level, so we can
> avoid unnecessary overhead of forced updates when there's nothing to do.
>
> We also ensure to ignore cpufreq udpates for sugov workers at context
> switch if it was prev task.
>
> The update at task_tick_fair() will guarantee that the governor will
> follow any updates to load for tasks/CPU or due to new enqueues/dequeues
> to the rq. Since DL and RT always run at constant frequencies and have
> no load tracking, this is only required for fair tasks.
>
> The update at update_blocked_averages() will ensure we decay frequency
> as the CPU becomes idle for long enough.
>
> If the currently running task changes its policy or uclamp values, we
> ensure we follow up with cpufreq update to ensure we follow up with any
> potential new perf requirements based on the new change.
>
> To handle systems with long TICK where tasks could end up enqueued but
> no preemption happens until TICK, we add an update in
> check_preempt_wakeup_fair() if wake up preemption fails. This will send
> special SCHED_CPUFREQ_TASK_ENQUEUED cpufreq update to tell the governor
> that the state of the CPU has changed and it can consider an update if
> it deems worthwhile. In schedutil this will do an update if no update
> was done since 1ms which is how often util_avg changes roughly.
>
> To ensure DL tasks bandwidth are respected, we do the update on
> __add_running_bw() instead of context switch as the delay could result
> in missing a deadline when multiple DL tasks are RUNNING.
>
> Since now DL tasks always ignore rate limit, remove
> ignore_dl_rate_limit() function as it's no longer necessary.
>
> Also move updating sg_cpu->last_update inside sugov_iowait_boost() where
> this variable is associated and rename it to last_iowait_update to
> better reflect it is iowait boost specific.
>
> Note worthy that we still have the following race condition on systems
> that have shared policy:
>
> * CPUs with shared policy can end up sending simultaneous cpufreq
> updates requests where the 2nd one will be unlucky and get blocked by
> the rate_limit_us (schedutil).
>
> We can potentially address this limitation later, but it is out of the
> scope of this patch.
>
> Signed-off-by: Qais Yousef <qyousef@xxxxxxxxxxx>
> ---
>
> Changes since v6:
>
> * Fix typos in commit message
> * Move DL to enqueue to address breaking bandwidth rules for DL
> * Do freq updates for SCHED_IDLE too
> * Ensure wakeup preemption will cause cpufreq updates even if
> cfs_rq.decayed was false as util_est could be high and cfs_rq.decayed
> wouldn't reflect that.
> * Ensure we send an update if we switch to fair from RT or DL as this
> is an opportunity to reduce freq even if cfs_rq.decayed is false.
> * If sched_setsched() syscall for a queued task requires cpufreq
> update, handle it like we do for wakeup_preemption_check()
> * Use 1ms instead of base_slice to send an update if wakeup preemption
> fails
> * Fix a bug in setting sg_cpu->last_update being updated too early
> causing some systems to always request 1024 io boost.
> * Change delta_ns <= NSEC_PER_MSEC to be strictly less than
> delta_ns < NSEC_PER_MSEC for iowait boost to match the condition for
> when a task was enqueued.
> * Moved the results of context switch test out of the commit messages
> as I am seeing some variations that I am not sure are due to binary
> differences causing weird caching effect or true overhead
>
> Results of
>
> taskset 1 perf record perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe
>
> on AMD 3900X to verify any potential overhead because of the addition at
> context switch against sched-core-2024-07-16 tip/sched/core
>
> tip sched-core-2024-07-16 schedutil:
> ------------------------------------
>
> Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs):
>
> 39,296,424,438 cycles # 3.208 GHz ( +- 0.05% )
> 20,350,055,343 instructions # 0.52 insn per cycle ( +- 0.03% )
> 12,274.17 msec task-clock # 1.002 CPUs utilized ( +- 0.06% )
>
> 12.24917 +- 0.00783 seconds time elapsed ( +- 0.06% )
>
> tip sched-core-2024-07-16 performance:
> --------------------------------------
>
> Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs):
>
> 40,610,243,585 cycles # 3.268 GHz ( +- 0.15% )
> 21,252,175,791 instructions # 0.53 insn per cycle ( +- 0.05% )
> 12,443.34 msec task-clock # 1.001 CPUs utilized ( +- 0.06% )
>
> 12.42761 +- 0.00672 seconds time elapsed ( +- 0.05% )
>
> patch: tip sched-core-2024-07-16 schedutil:
> -------------------------------------------
>
> Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs):
>
> 40,706,113,323 cycles # 3.253 GHz ( +- 0.07% )
> 21,163,304,319 instructions # 0.52 insn per cycle ( +- 0.04% )
> 12,494.93 msec task-clock # 0.998 CPUs utilized ( +- 0.04% )
>
> 12.51557 +- 0.00486 seconds time elapsed ( +- 0.04% )
>
> patch: tip sched-core-2024-07-16 performance:
> ---------------------------------------------
>
> Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs):
>
> 39,654,998,545 cycles # 3.220 GHz ( +- 0.12% )
> 20,554,376,621 instructions # 0.52 insn per cycle ( +- 0.12% )
> 12,317.02 msec task-clock # 1.000 CPUs utilized ( +- 0.16% )
>
> 12.3166 +- 0.0193 seconds time elapsed ( +- 0.16% )
>
> We do better in performance governor than tip/sched/core. But schedutil looks
> worse. Looking at perf diff I can see update_load_avg() and
> sugov_update_single_freq() but not sure if this is due to this patch per se
> rather than strange binary difference creating unexpected effect. The hot
> instructions in update_load_avg() are not related to the new code added there.
> Similarly for check_preempt_wakeup_fair().
>
> For sugov_update_single_freq() this hasn't shown up in previous versions.
> Removing the new cpufreq update in check_preempt_wakeup_fair() didn't help.
>
> Note that in v6 same test showed that schedutil was on par but performance was
> slightly worse. Though the test was against 6.8.7 stable kernel then.
>
> perf diff schedutil:
> --------------------
>
> 10.56% -2.56% [kernel.kallsyms] [k] delay_halt_mwaitx
> 14.56% -1.46% [kernel.kallsyms] [k] native_read_msr
> 14.19% -1.40% [kernel.kallsyms] [k] native_write_msr
> 0.63% +0.54% [kernel.kallsyms] [k] restore_fpregs_from_fpstate
> 1.52% +0.52% [kernel.kallsyms] [k] update_load_avg
> 0.01% +0.47% [kernel.kallsyms] [k] sugov_update_single_freq
> 3.44% -0.35% [kernel.kallsyms] [k] amd_pmu_addr_offset
> 4.67% -0.31% [kernel.kallsyms] [k] x86_pmu_disable_all
> 0.35% +0.29% [kernel.kallsyms] [k] check_preempt_wakeup_fair
> 1.81% -0.28% [kernel.kallsyms] [k] amd_pmu_check_overflow
> 1.81% -0.27% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit
> 1.20% +0.26% [kernel.kallsyms] [k] pick_next_task_fair
> 0.01% +0.22% [kernel.kallsyms] [k] __get_user_8
> 1.41% +0.21% [kernel.kallsyms] [k] update_curr
> 1.18% -0.21% [kernel.kallsyms] [k] delay_halt
> 0.50% +0.21% [kernel.kallsyms] [k] pick_eevdf
> 3.13% +0.20% [kernel.kallsyms] [k] srso_safe_ret
> 0.00% +0.18% [kernel.kallsyms] [k] sugov_get_util
> 1.23% +0.17% [kernel.kallsyms] [k] __schedule
> 0.50% +0.16% [kernel.kallsyms] [k] enqueue_entity
> 0.57% +0.16% [kernel.kallsyms] [k] psi_task_change
> 0.57% +0.15% [kernel.kallsyms] [k] enqueue_task_fair
> 1.06% -0.15% [kernel.kallsyms] [k] apparmor_file_permission
> 0.80% +0.15% [kernel.kallsyms] [k] try_to_wake_up
> 1.07% +0.14% [kernel.kallsyms] [k] psi_task_switch
> 1.58% +0.14% [kernel.kallsyms] [k] pipe_write
> 0.86% +0.14% [kernel.kallsyms] [k] syscall_exit_to_user_mode
> 1.02% +0.13% [kernel.kallsyms] [k] native_sched_clock
> 0.46% +0.11% [kernel.kallsyms] [k] __update_load_avg_se
>
> perf diff performance:
> ----------------------
>
> 13.09% +3.06% [kernel.kallsyms] [k] native_read_msr
> 13.12% +2.84% [kernel.kallsyms] [k] native_write_msr
> 7.94% +2.34% [kernel.kallsyms] [k] delay_halt_mwaitx
> 2.15% -0.93% [kernel.kallsyms] [k] update_curr
> 4.42% +0.87% [kernel.kallsyms] [k] x86_pmu_disable_all
> 3.12% +0.74% [kernel.kallsyms] [k] amd_pmu_addr_offset
> 2.84% -0.59% [kernel.kallsyms] [k] psi_group_change
> 1.44% +0.53% [kernel.kallsyms] [k] amd_pmu_check_overflow
> 1.45% +0.50% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit
> 0.47% -0.47% [kernel.kallsyms] [k] __calc_delta.constprop.0
> 1.60% -0.40% [kernel.kallsyms] [k] pick_next_task_fair
> 1.97% -0.37% [kernel.kallsyms] [k] update_load_avg
> 0.57% -0.37% [kernel.kallsyms] [k] avg_vruntime
> 0.82% -0.37% [kernel.kallsyms] [k] enqueue_task_fair
> 1.54% -0.34% [kernel.kallsyms] [k] __schedule
> 0.79% -0.32% [kernel.kallsyms] [k] pick_eevdf
> 0.88% +0.32% [kernel.kallsyms] [k] delay_halt
> 0.59% -0.28% [kernel.kallsyms] [k] update_cfs_group
> 0.86% -0.25% [kernel.kallsyms] [k] try_to_wake_up
> 1.18% -0.25% [kernel.kallsyms] [k] native_sched_clock
> 0.45% -0.24% [kernel.kallsyms] [k] put_prev_entity
> 0.49% -0.24% [kernel.kallsyms] [k] ttwu_do_activate
> 0.64% -0.23% [kernel.kallsyms] [k] enqueue_entity
> 0.72% -0.22% [kernel.kallsyms] [k] __update_load_avg_cfs_rq
> 1.57% -0.22% [kernel.kallsyms] [k] pipe_write
> 0.50% -0.20% [kernel.kallsyms] [k] update_min_vruntime
> 3.31% -0.19% [kernel.kallsyms] [k] srso_safe_ret
> 1.31% -0.18% [kernel.kallsyms] [k] psi_task_switch
> 0.52% -0.18% [kernel.kallsyms] [k] check_preempt_wakeup_fair
> 0.32% -0.16% [kernel.kallsyms] [k] __enqueue_entity
> 0.87% -0.16% [kernel.kallsyms] [k] dequeue_task_fair
> 0.44% -0.14% [kernel.kallsyms] [k] pick_next_entity
> 0.63% -0.13% [kernel.kallsyms] [k] psi_task_change
> 0.62% -0.13% [kernel.kallsyms] [k] sched_clock_cpu
>
> Changes since v5:
>
> * Fix a bug where switching between RT and sugov tasks triggered an
> endless cycle of cpufreq updates.
> * Only do cpufreq updates at tick for fair after verifying
> rq->cfs.decayed
> * Remove optimization in update_load_avg() to avoid sending an update
> if util hasn't changed that caused a bug when switching from Idle
> * Handle systems with long ticks by adding extra update on
> check_preempt_wakeup_fair(). The idea is to rely on context switch
> but still consider an update if wakeup preemption failed and no
> update was sent since sysctl_sched_base_slice
> * Remove ignore_dl_rate_limit() as this function is now redundant
> * move sg_cpu->last_update = time inside sugov_iowait_boost()
> * Update commit message with new details and with perf diff output
>
> Changes since v4:
>
> * Fix updating freq when uclamp changes before the dequeue/enqueue
> dance. (Hongyan)
> * Rebased on top of tip/sched/core 6.10-rc1 and resolve some conflicts
> due to code shuffling to syscalls.c. Added new function
> update_cpufreq_current() to be used outside core.c when
> task_current() requires cpufreq update.
>
> Changes since v3:
>
> * Omit cpufreq updates at attach/detach_entity_load_avg(). They share
> the update path from enqueue/dequeue which is not intended to trigger
> an update. And task_change_group_fair() is not expected to cause the
> root cfs_rq util to change significantly to warrant an immediate
> update for enqueued tasks. Better defer for next context switch to
> sample the state of the cpu taking all changes into account before
> the next task is due to run.
> Dietmar also pointed out a bug where we could send more updates vs
> without the patch in this path as I wasn't sending the update for
> cfs_rq == &rq->cfs.
>
> Changes since v2:
>
> * Clean up update_cpufreq_ctx_switch() to reduce branches (Peter)
> * Fix issue with cpufreq updates missed on switching from idle (Vincent)
> * perf bench sched pipe regressed after fixing the switch from idle,
> detect when util_avg has changed when cfs_rq->decayed to fix it
> * Ensure to issue cpufreq updates when task_current() switches
> policy/uclamp values
>
> Changes since v1:
>
> * Use taskset and measure with performance governor as Ingo suggested
> * Remove the static key as I found out we always register a function
> for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
> trigger a lock debug warning.
> * Improve detection of sugov workers by using SCHED_FLAG_SUGOV
> * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
> reducing iowait boost as the latter was a NOP and like
> sugov_iowait_reset() like Christian pointed out.
>
> v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@xxxxxxxxxxx/
> v2 discussion: https://lore.kernel.org/lkml/20240505233103.168766-1-qyousef@xxxxxxxxxxx/
> v3 discussion: https://lore.kernel.org/lkml/20240512190018.531820-1-qyousef@xxxxxxxxxxx/
> v4 discussion: https://lore.kernel.org/lkml/20240516204802.846520-1-qyousef@xxxxxxxxxxx/
> v5 discussion: https://lore.kernel.org/lkml/20240530104653.1234004-1-qyousef@xxxxxxxxxxx/
> v6 discussion: https://lore.kernel.org/lkml/20240619201409.2071728-1-qyousef@xxxxxxxxxxx/
>
> include/linux/sched/cpufreq.h | 4 +-
> kernel/sched/core.c | 116 +++++++++++++++++++++++++++--
> kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++------------
> kernel/sched/deadline.c | 10 ++-
> kernel/sched/fair.c | 91 +++++++++++------------
> kernel/sched/rt.c | 8 +-
> kernel/sched/sched.h | 9 ++-
> kernel/sched/syscalls.c | 30 ++++++--
> 8 files changed, 271 insertions(+), 119 deletions(-)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index bdd31ab93bc5..5409a9f79cc0 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -8,7 +8,9 @@
> * Interface between cpufreq drivers and the scheduler:
> */
>
> -#define SCHED_CPUFREQ_IOWAIT (1U << 0)
> +#define SCHED_CPUFREQ_IOWAIT (1U << 0)
> +#define SCHED_CPUFREQ_FORCE_UPDATE (1U << 1) /* ignore transition_delay_us */
> +#define SCHED_CPUFREQ_TASK_ENQUEUED (1U << 2) /* new fair task was enqueued */
>
> #ifdef CONFIG_CPU_FREQ
> struct cpufreq_policy;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d35c48239be..a31d91a224d0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -153,6 +153,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
>
> __read_mostly int scheduler_running;
>
> +static __always_inline void
> +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev);
> +
> #ifdef CONFIG_SCHED_CORE
>
> DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> @@ -2038,17 +2041,24 @@ inline int task_curr(const struct task_struct *p)
> * this means any call to check_class_changed() must be followed by a call to
> * balance_callback().
> */
> -void check_class_changed(struct rq *rq, struct task_struct *p,
> +bool check_class_changed(struct rq *rq, struct task_struct *p,
> const struct sched_class *prev_class,
> int oldprio)
> {
> + bool class_changed = false;
> +
> if (prev_class != p->sched_class) {
> if (prev_class->switched_from)
> prev_class->switched_from(rq, p);
>
> p->sched_class->switched_to(rq, p);
> - } else if (oldprio != p->prio || dl_task(p))
> +
> + class_changed = true;
> + } else if (oldprio != p->prio || dl_task(p)) {
> p->sched_class->prio_changed(rq, p, oldprio);
> + }
> +
> + return class_changed;
> }
>
> void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
> @@ -4913,6 +4923,93 @@ static inline void __balance_callbacks(struct rq *rq)
>
> #endif
>
> +static __always_inline void
> +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> +{
> +#ifdef CONFIG_CPU_FREQ
> + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
> + /* Sugov just did an update, don't be too aggressive */
> + return;
> + }
> +
> + /*
> + * RT and DL should always send a freq update. But we can do some
> + * simple checks to avoid it when we know it's not necessary.
> + *
> + * iowait_boost will always trigger a freq update too.
> + *
> + * Fair tasks will only trigger an update if the root cfs_rq has
> + * decayed.
> + *
> + * Everything else should do nothing.
> + */
> + switch (current->policy) {
> + case SCHED_NORMAL:
> + case SCHED_BATCH:
> + case SCHED_IDLE:
> + if (unlikely(current->in_iowait)) {
> + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
> + return;
> + }
> +
> +#ifdef CONFIG_SMP
> + /*
> + * Send an update if we switched from RT or DL as they tend to
> + * boost the CPU and we are likely able to reduce the freq now.
> + */
> + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy));
> +
> + if (unlikely(rq->cfs.decayed)) {
> + rq->cfs.decayed = false;
> + cpufreq_update_util(rq, 0);
> + return;
> + }
> +#else
> + cpufreq_update_util(rq, 0);
> +#endif
> + return;
> + case SCHED_FIFO:
> + case SCHED_RR:
> + if (prev && rt_policy(prev->policy)) {
> +#ifdef CONFIG_UCLAMP_TASK
> + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> +
> + if (curr_uclamp_min == prev_uclamp_min)
> +#endif
> + return;
> + }
> +#ifdef CONFIG_SMP
> + /* Stopper task masquerades as RT */
> + if (unlikely(current->sched_class == &stop_sched_class))
> + return;
> +#endif
> + cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE);
> + return;
> + case SCHED_DEADLINE:
> + /*
> + * This is handled at enqueue to avoid breaking DL bandwidth
> + * rules when multiple DL tasks are running on the same CPU.
> + * Deferring till context switch here could mean the bandwidth
> + * calculations would be broken to ensure all the DL tasks meet
> + * their deadlines.
> + */
> + return;
> + default:
> + return;
> + }
> +#endif
> +}
> +
> +/*
> + * Call when currently running task had an attribute change that requires
> + * an immediate cpufreq update.
> + */
> +void update_cpufreq_current(struct rq *rq)
> +{
> + __update_cpufreq_ctx_switch(rq, NULL);
> +}
> +
> static inline void
> prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> {
> @@ -4930,7 +5027,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
> #endif
> }
>
> -static inline void finish_lock_switch(struct rq *rq)
> +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> {
> /*
> * If we are tracking spinlock dependencies then we have to
> @@ -4939,6 +5036,11 @@ static inline void finish_lock_switch(struct rq *rq)
> */
> spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
> __balance_callbacks(rq);
> + /*
> + * Request freq update after __balance_callbacks to take into account
> + * any changes to rq.
> + */
> + __update_cpufreq_ctx_switch(rq, prev);
> raw_spin_rq_unlock_irq(rq);
> }
>
> @@ -5057,7 +5159,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> perf_event_task_sched_in(prev, current);
> finish_task(prev);
> tick_nohz_task_switch();
> - finish_lock_switch(rq);
> + finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();
> kcov_finish_switch(current);
> /*
> @@ -6920,6 +7022,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> int prio, oldprio, queued, running, queue_flag =
> DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> const struct sched_class *prev_class;
> + bool class_changed;
> struct rq_flags rf;
> struct rq *rq;
>
> @@ -7021,7 +7124,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> if (running)
> set_next_task(rq, p);
>
> - check_class_changed(rq, p, prev_class, oldprio);
> + class_changed = check_class_changed(rq, p, prev_class, oldprio);
> + if (class_changed && running)
> + update_cpufreq_current(rq);
> +
> out_unlock:
> /* Avoid rq from going away on us: */
> preempt_disable();
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..64f614b3db20 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -44,7 +44,7 @@ struct sugov_cpu {
>
> bool iowait_boost_pending;
> unsigned int iowait_boost;
> - u64 last_update;
> + u64 last_iowait_update;
>
> unsigned long util;
> unsigned long bw_min;
> @@ -59,10 +59,31 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
> /************************ Governor internals ***********************/
>
> -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> + unsigned int flags)
> {
> s64 delta_ns;
>
> + delta_ns = time - sg_policy->last_freq_update_time;
> +
> + /*
> + * We want to update cpufreq at context switch, but on systems with
> + * long TICK values, this can happen after a long time while more tasks
> + * would have been added meanwhile leaving us potentially running at
> + * inadequate frequency for extended period of time.
> + *
> + * This logic should only apply when new fair task was added to the
> + * CPU, we'd want to defer to context switch as much as possible, but
> + * to avoid the potential delays mentioned above, let's check if this
> + * additional tasks warrants sending an update sooner.
> + *
> + * We want to ensure there's at least an update every 1ms.
> + */
> + if (likely(flags & SCHED_CPUFREQ_TASK_ENQUEUED)) {
> + if (delta_ns < NSEC_PER_MSEC)
> + return false;
> + }
> +
> /*
> * Since cpufreq_update_util() is called with rq->lock held for
> * the @target_cpu, our per-CPU data is fully serialized.
> @@ -87,13 +108,14 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> return true;
> }
>
> - delta_ns = time - sg_policy->last_freq_update_time;
> + if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> + return true;
>
> return delta_ns >= sg_policy->freq_update_delay_ns;
> }
>
> static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> - unsigned int next_freq)
> + unsigned int next_freq, unsigned int flags)
> {
> if (sg_policy->need_freq_update)
> sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
> @@ -101,7 +123,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> return false;
>
> sg_policy->next_freq = next_freq;
> - sg_policy->last_freq_update_time = time;
> +
> + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> + sg_policy->last_freq_update_time = time;
>
> return true;
> }
> @@ -219,7 +243,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time,
> bool set_iowait_boost)
> {
> - s64 delta_ns = time - sg_cpu->last_update;
> + s64 delta_ns = time - sg_cpu->last_iowait_update;
>
> /* Reset boost only if a tick has elapsed since last request */
> if (delta_ns <= TICK_NSEC)
> @@ -249,30 +273,33 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> unsigned int flags)
> {
> bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
>
> /* Reset boost if the CPU appears to have been idle enough */
> - if (sg_cpu->iowait_boost &&
> + if (sg_cpu->iowait_boost && !forced_update &&
> sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
> - return;
> + goto done;
>
> /* Boost only tasks waking up after IO */
> if (!set_iowait_boost)
> - return;
> + goto done;
>
> /* Ensure boost doubles only one time at each request */
> if (sg_cpu->iowait_boost_pending)
> - return;
> + goto done;
> sg_cpu->iowait_boost_pending = true;
>
> /* Double the boost at each request */
> if (sg_cpu->iowait_boost) {
> sg_cpu->iowait_boost =
> min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE);
> - return;
> + goto done;
> }
>
> /* First wakeup after IO: start with minimum boost */
> sg_cpu->iowait_boost = IOWAIT_BOOST_MIN;
> +done:
> + sg_cpu->last_iowait_update = time;
> }
>
> /**
> @@ -294,17 +321,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> * being more conservative on tasks which does sporadic IO operations.
> */
> static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> - unsigned long max_cap)
> + unsigned long max_cap, unsigned int flags)
> {
> + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> + s64 delta_ns = time - sg_cpu->last_iowait_update;
> +
> /* No boost currently required */
> if (!sg_cpu->iowait_boost)
> return 0;
>
> + if (forced_update)
> + goto apply_boost;
> +
> /* Reset boost if the CPU appears to have been idle enough */
> if (sugov_iowait_reset(sg_cpu, time, false))
> return 0;
>
> if (!sg_cpu->iowait_boost_pending) {
> + /*
> + * This logic relied on PELT signal decays happening once every
> + * 1ms. But due to changes to how updates are done now, we can
> + * end up with more request coming up leading to iowait boost
> + * to be prematurely reduced. Make the assumption explicit
> + * until we improve the iowait boost logic to be better in
> + * general as it is due for an overhaul.
> + */
> + if (delta_ns < NSEC_PER_MSEC)
> + goto apply_boost;
> +
> /*
> * No boost pending; reduce the boost value.
> */
> @@ -315,6 +359,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> }
> }
>
> +apply_boost:
> sg_cpu->iowait_boost_pending = false;
>
> /*
> @@ -337,31 +382,18 @@ static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> #endif /* CONFIG_NO_HZ_COMMON */
>
> -/*
> - * Make sugov_should_update_freq() ignore the rate limit when DL
> - * has increased the utilization.
> - */
> -static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
> -{
> - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
> - sg_cpu->sg_policy->limits_changed = true;
> -}
> -
> static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> u64 time, unsigned long max_cap,
> unsigned int flags)
> {
> unsigned long boost;
>
> - sugov_iowait_boost(sg_cpu, time, flags);
> - sg_cpu->last_update = time;
> -
> - ignore_dl_rate_limit(sg_cpu);
> -
> - if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> + if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags))
> return false;
>
> - boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> + sugov_iowait_boost(sg_cpu, time, flags);
> +
> + boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags);
> sugov_get_util(sg_cpu, boost);
>
> return true;
> @@ -397,7 +429,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> sg_policy->cached_raw_freq = cached_freq;
> }
>
> - if (!sugov_update_next_freq(sg_policy, time, next_f))
> + if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
> return;
>
> /*
> @@ -449,10 +481,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> sg_cpu->util, max_cap);
>
> - sg_cpu->sg_policy->last_freq_update_time = time;
> + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> + sg_cpu->sg_policy->last_freq_update_time = time;
> }
>
> -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time,
> + unsigned int flags)
> {
> struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> struct cpufreq_policy *policy = sg_policy->policy;
> @@ -465,7 +499,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> unsigned long boost;
>
> - boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> + boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags);
> sugov_get_util(j_sg_cpu, boost);
>
> util = max(j_sg_cpu->util, util);
> @@ -483,22 +517,20 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
> raw_spin_lock(&sg_policy->update_lock);
>
> - sugov_iowait_boost(sg_cpu, time, flags);
> - sg_cpu->last_update = time;
> + if (!sugov_should_update_freq(sg_policy, time, flags))
> + goto unlock;
>
> - ignore_dl_rate_limit(sg_cpu);
> + sugov_iowait_boost(sg_cpu, time, flags);
>
> - if (sugov_should_update_freq(sg_policy, time)) {
> - next_f = sugov_next_freq_shared(sg_cpu, time);
> + next_f = sugov_next_freq_shared(sg_cpu, time, flags);
>
> - if (!sugov_update_next_freq(sg_policy, time, next_f))
> - goto unlock;
> + if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
> + goto unlock;
>
> - if (sg_policy->policy->fast_switch_enabled)
> - cpufreq_driver_fast_switch(sg_policy->policy, next_f);
> - else
> - sugov_deferred_update(sg_policy);
> - }
> + if (sg_policy->policy->fast_switch_enabled)
> + cpufreq_driver_fast_switch(sg_policy->policy, next_f);
> + else
> + sugov_deferred_update(sg_policy);
> unlock:
> raw_spin_unlock(&sg_policy->update_lock);
> }
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f59e5c19d944..8a4ccf532a7b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -251,8 +251,12 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> dl_rq->running_bw += dl_bw;
> SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> - /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
> + /*
> + * Context switch handles updates, but this is an exception to ensure
> + * multiple DL tasks run at the correct frequencies. We don't need
> + * a cpufreq update on dequeue, context switch will handle that.
> + */
> + cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_FORCE_UPDATE);
> }
>
> static inline
> @@ -265,8 +269,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> if (dl_rq->running_bw > old)
> dl_rq->running_bw = 0;
> - /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
> }
>
> static inline
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9057584ec06d..8fe7a7124c70 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3987,29 +3987,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> }
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> - struct rq *rq = rq_of(cfs_rq);
> -
> - if (&rq->cfs == cfs_rq) {
> - /*
> - * There are a few boundary cases this might miss but it should
> - * get called often enough that that should (hopefully) not be
> - * a real problem.
> - *
> - * It will not get called when we go idle, because the idle
> - * thread is a different class (!fair), nor will the utilization
> - * number include things like RT tasks.
> - *
> - * As is, the util number is not freq-invariant (we'd have to
> - * implement arch_scale_freq_capacity() for that).
> - *
> - * See cpu_util_cfs().
> - */
> - cpufreq_update_util(rq, flags);
> - }
> -}
> -
> #ifdef CONFIG_SMP
> static inline bool load_avg_is_decayed(struct sched_avg *sa)
> {
> @@ -4687,8 +4664,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, 0);
> -
> trace_pelt_cfs_tp(cfs_rq);
> }
>
> @@ -4717,8 +4692,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, 0);
> -
> trace_pelt_cfs_tp(cfs_rq);
> }
>
> @@ -4765,12 +4738,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> */
> detach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq);
> - } else if (decayed) {
> - cfs_rq_util_change(cfs_rq, 0);
> -
> - if (flags & UPDATE_TG)
> - update_tg_load_avg(cfs_rq);
> + } else if (decayed && (flags & UPDATE_TG)) {
> + update_tg_load_avg(cfs_rq);
> }
> +
> + /*
> + * If this is the root cfs_rq, set the decayed flag to let the world
> + * know a cpufreq update is required.
> + */
> + if (cfs_rq == &rq_of(cfs_rq)->cfs)
> + cfs_rq->decayed |= decayed;
> }
>
> /*
> @@ -5144,7 +5121,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> {
> - cfs_rq_util_change(cfs_rq, 0);
> }
>
> static inline void remove_entity_load_avg(struct sched_entity *se) {}
> @@ -6759,14 +6735,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> */
> util_est_enqueue(&rq->cfs, p);
>
> - /*
> - * If in_iowait is set, the code below may not trigger any cpufreq
> - * utilization updates, so do it here explicitly with the IOWAIT flag
> - * passed.
> - */
> - if (p->in_iowait)
> - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
> for_each_sched_entity(se) {
> if (se->on_rq)
> break;
> @@ -8353,7 +8321,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> int cse_is_idle, pse_is_idle;
>
> if (unlikely(se == pse))
> - return;
> + goto nopreempt;
>
> /*
> * This is possible from callers such as attach_tasks(), in which we
> @@ -8362,7 +8330,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> * next-buddy nomination below.
> */
> if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
> - return;
> + goto nopreempt;
>
> if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
> set_next_buddy(pse);
> @@ -8379,7 +8347,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> * below.
> */
> if (test_tsk_need_resched(curr))
> - return;
> + goto nopreempt;
>
> /* Idle tasks are by definition preempted by non-idle tasks. */
> if (unlikely(task_has_idle_policy(curr)) &&
> @@ -8391,7 +8359,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> * is driven by the tick):
> */
> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> - return;
> + goto nopreempt;
>
> find_matching_se(&se, &pse);
> WARN_ON_ONCE(!pse);
> @@ -8406,7 +8374,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> if (cse_is_idle && !pse_is_idle)
> goto preempt;
> if (cse_is_idle != pse_is_idle)
> - return;
> + goto nopreempt;
>
> cfs_rq = cfs_rq_of(se);
> update_curr(cfs_rq);
> @@ -8417,6 +8385,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> if (pick_eevdf(cfs_rq) == pse)
> goto preempt;
>
> +nopreempt:
> + if (rq->cfs.h_nr_running > 1) {
> +#ifdef CONFIG_SMP
> + /*
> + * When a task is added, its util_est could be high but the
> + * enqueue might not have caused rq->cfs.decayed to be updated
> + * as it is small after a long sleep. So set it to ensure next
> + * context switch will definitely trigger an update after the
> + * new enqueue.
> + *
> + * TODO: we need to make cpufreq_update_util() return true if
> + * the operation was successful or false if it failed and use
> + * that to reset rq->cfs.decayed.
> + */
> + rq->cfs.decayed = true;
> +#endif
> + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED);
> + }
> return;
>
> preempt:
> @@ -9352,10 +9338,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> unsigned long hw_pressure;
> bool decayed;
>
> - /*
> - * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> - * DL and IRQ signals have been updated before updating CFS.
> - */
> curr_class = rq->curr->sched_class;
>
> hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> @@ -12692,6 +12674,15 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> update_misfit_status(curr, rq);
> check_update_overutilized_status(task_rq(curr));
>
> +#ifdef CONFIG_SMP
> + if (rq->cfs.decayed) {
> + rq->cfs.decayed = false;
> + cpufreq_update_util(rq, 0);
> + }
> +#else
> + cpufreq_update_util(rq, 0);
> +#endif
> +
> task_tick_core(rq, curr);
> }
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 63e49c8ffc4d..92ed373e5b90 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -555,11 +555,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
>
> rt_se = rt_rq->tg->rt_se[cpu];
>
> - if (!rt_se) {
> + if (!rt_se)
> dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
> - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> - cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
> - }
> else if (on_rt_rq(rt_se))
> dequeue_rt_entity(rt_se, 0);
> }
> @@ -1064,9 +1061,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq)
> add_nr_running(rq, rt_rq->rt_nr_running);
> rt_rq->rt_queued = 1;
> }
> -
> - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> - cpufreq_update_util(rq, 0);
> }
>
> #if defined CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4c36cc680361..1fc9339dd5c7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -639,6 +639,11 @@ struct cfs_rq {
> unsigned long runnable_avg;
> } removed;
>
> + /*
> + * Store whether last update_load_avg() has decayed
> + */
> + bool decayed;
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> u64 last_update_tg_load_avg;
> unsigned long tg_load_avg_contrib;
> @@ -3609,10 +3614,12 @@ extern void set_load_weight(struct task_struct *p, bool update_load);
> extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
> extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags);
>
> -extern void check_class_changed(struct rq *rq, struct task_struct *p,
> +extern bool check_class_changed(struct rq *rq, struct task_struct *p,
> const struct sched_class *prev_class,
> int oldprio);
>
> +extern void update_cpufreq_current(struct rq *rq);
> +
> #ifdef CONFIG_SMP
> extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
> extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index ae1b42775ef9..c24769cf1a4f 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -491,7 +491,7 @@ static bool uclamp_reset(const struct sched_attr *attr,
> return false;
> }
>
> -static void __setscheduler_uclamp(struct task_struct *p,
> +static bool __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> enum uclamp_id clamp_id;
> @@ -517,7 +517,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
> }
>
> if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> - return;
> + return false;
>
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
> attr->sched_util_min != -1) {
> @@ -530,6 +530,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
> uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
> attr->sched_util_max, true);
> }
> +
> + return true;
> }
>
> #else /* !CONFIG_UCLAMP_TASK: */
> @@ -539,8 +541,11 @@ static inline int uclamp_validate(struct task_struct *p,
> {
> return -EOPNOTSUPP;
> }
> -static void __setscheduler_uclamp(struct task_struct *p,
> - const struct sched_attr *attr) { }
> +static bool __setscheduler_uclamp(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + return false;
> +}
> #endif
>
> /*
> @@ -614,6 +619,7 @@ int __sched_setscheduler(struct task_struct *p,
> int retval, oldprio, newprio, queued, running;
> const struct sched_class *prev_class;
> struct balance_callback *head;
> + bool update_cpufreq;
> struct rq_flags rf;
> int reset_on_fork;
> int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> @@ -796,7 +802,8 @@ int __sched_setscheduler(struct task_struct *p,
> __setscheduler_params(p, attr);
> __setscheduler_prio(p, newprio);
> }
> - __setscheduler_uclamp(p, attr);
> +
> + update_cpufreq = __setscheduler_uclamp(p, attr);
>
> if (queued) {
> /*
> @@ -811,7 +818,18 @@ int __sched_setscheduler(struct task_struct *p,
> if (running)
> set_next_task(rq, p);
>
> - check_class_changed(rq, p, prev_class, oldprio);
> + update_cpufreq |= check_class_changed(rq, p, prev_class, oldprio);
> +
> + /*
> + * Changing class or uclamp value implies requiring to send cpufreq
> + * update.
> + */
> + if (update_cpufreq) {
> + if (running)
> + update_cpufreq_current(rq);
> + else if (queued)
> + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED);
> + }

cpufreq_update_util() -> sugov_should_update_freq() -> cpufreq_this_cpu_can_update()
relies on smp_processor_id(), should this move this below the
preempt_disable() to avoid sending an update from an illegal CPU?

>
> /* Avoid rq from going away on us: */
> preempt_disable();