Re: [PATCH v6] sched: Consolidate cpufreq updates
From: Dietmar Eggemann
Date: Thu Jul 04 2024 - 06:13:10 EST
On 28/06/2024 03:52, Qais Yousef wrote:
> On 06/25/24 14:58, Dietmar Eggemann wrote:
>
>>> @@ -4917,6 +4927,84 @@ static inline void __balance_callbacks(struct rq *rq)
>>>
>>> #endif
>>>
>>> +static __always_inline void
>>> +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
>>> +{
>>> +#ifdef CONFIG_CPU_FREQ
>>> + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) {
>>> + /* Sugov just did an update, don't be too aggressive */
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * RT and DL should always send a freq update. But we can do some
>>> + * simple checks to avoid it when we know it's not necessary.
>>> + *
>>> + * iowait_boost will always trigger a freq update too.
>>> + *
>>> + * Fair tasks will only trigger an update if the root cfs_rq has
>>> + * decayed.
>>> + *
>>> + * Everything else should do nothing.
>>> + */
>>> + switch (current->policy) {
>>> + case SCHED_NORMAL:
>>> + case SCHED_BATCH:
>>
>> What about SCHED_IDLE tasks?
>
> I didn't think they matter from cpufreq perspective. These tasks will just run
> at whatever the idle system is happen to be at and have no specific perf
> requirement since they should only run when the system is idle which a recipe
> for starvation anyway?
Not sure we talk about the same thing here? idle_sched_class vs.
SCHED_IDLE policy (FAIR task with a tiny weight of WEIGHT_IDLEPRIO).
>>> + if (unlikely(current->in_iowait)) {
>>> + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE);
>>> + return;
>>> + }
>>> +
>>> +#ifdef CONFIG_SMP
>>> + if (unlikely(rq->cfs.decayed)) {
>>> + rq->cfs.decayed = false;
>>> + cpufreq_update_util(rq, 0);
>>> + return;
>>> + }
>>> +#else
>>> + cpufreq_update_util(rq, 0);
>>> +#endif
>>
>> We can have !CONFIG_SMP and CONFIG_FAIR_GROUP_SCHED systems. Does this
>> mean on those systems we call cpufreq_update_util() for each cfs_rq of
>> the hierarchy where on CONFIG_SMP we only do this for the root cfs_rq?
>
> No. This is called on context switch only and hierarchy doesn't matter here. We
> just do it unconditionally for UP since we only track the decayed at cfs_rq
> level and I didn't think it's worth trying to make it at rq level.
OK, I see. The call in __update_cpufreq_ctx_switch() plus
(task_tick_fair() and check_preempt_wakeup_fair()) are not related to a
cfs_rq, but rather to the rq and/or task directly.
Currently we have the thing in update_load_avg() for !CONFIG_SMP and
there we use cfs_rq_util_change() which only calls cpufreq_update_util()
for root cfs_rq but this clearly has a cfs_rq context.
>> [...]
>>
>>> @@ -4744,8 +4716,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>> if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>>> __update_load_avg_se(now, cfs_rq, se);
>>>
>>> - decayed = update_cfs_rq_load_avg(now, cfs_rq);
>>> - decayed |= propagate_entity_load_avg(se);
>>> + cfs_rq->decayed |= update_cfs_rq_load_avg(now, cfs_rq);
>>> + cfs_rq->decayed |= propagate_entity_load_avg(se);
>>>
>>> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>>>
>>> @@ -4766,11 +4738,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>> */
>>> detach_entity_load_avg(cfs_rq, se);
>>> update_tg_load_avg(cfs_rq);
>>> - } else if (decayed) {
>>> - cfs_rq_util_change(cfs_rq, 0);
>>> -
>>> - if (flags & UPDATE_TG)
>>> - update_tg_load_avg(cfs_rq);
>>> + } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
>>> + update_tg_load_avg(cfs_rq);
>>> }
>>> }
>>
>> You set cfs_rq->decayed for each taskgroup level but you only reset it
>> for the root cfs_rq in __update_cpufreq_ctx_switch() and task_tick_fair()?
>
> Yes. We only care about using it for root level. Tracking the information at
> cfs_rq level is the most natural way to do it as this is what update_load_avg()
> is acting on.
But IMHO this creates an issue with those non-root cfs_rq's within
update_load_avg() itself. They will stay decayed after cfs_rq->decayed
has been set to 1 once and will never be reset to 0. So with UPDATE_TG
update_tg_load_avg() will then always be called on those non-root
cfs_rq's all the time.
[...]