Re: [PATCH] sched/fair: Allow decaying util_est when util_avg > CPU capa
From: Dietmar Eggemann
Date: Mon Mar 31 2025 - 06:46:26 EST
On 27/03/2025 10:35, Pierre Gondois wrote:
>
>
> On 3/26/25 18:25, Vincent Guittot wrote:
>> On Tue, 25 Mar 2025 at 16:06, Pierre Gondois <pierre.gondois@xxxxxxx>
>> wrote:
>>>
>>> commit 10a35e6812aa ("sched/pelt: Skip updating util_est when
>>> utilization is higher than CPU's capacity")
>>> prevents util_est from being updated if util_avg is higher than the
>>> underlying CPU capacity to avoid overestimating the task when the CPU
>>> is capped (due to thermal issue for instance). In this scenario, the
>>> task will miss its deadlines and start overlapping its wake-up events
>>> for instance. The task will appear as always running when the CPU is
>>> just not powerful enough to allow having a good estimation of the
>>> task.
This one will be removed by your patch, right?
>>>
>>> commit b8c96361402a ("sched/fair/util_est: Implement faster ramp-up
>>> EWMA on utilization increases")
>>> sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
>>> grow instead of slowly converge to the new util_avg value when a task
>>> profile changes from small to big.
>>>
>>> However, the 2 conditions:
>>> - Check util_avg against max CPU capacity
I assume this is the condition you remove and
>>> - Check whether util_est > util_avg
this is:
4918 /*
4919 * Reset EWMA on utilization increases, the moving average is used
4920 * to smooth utilization decreases.
4921 */
4922 if (ewma <= dequeued) {
4923 ewma = dequeued;
4924 goto done;
4925 }
which is before the condition you remove?
So maybe explain those conditions and their order more carefully? So
it's easier to grasp.
>>> are placed in an order such as it is possible to set util_est to a
>>> value higher than the CPU capacity if util_est > util_avg, but
>>> util_est is prevented to decay as long as:
>>> CPU capacity < util_avg < util_est.
Maybe mentioning 'util_avg eq. dequeued' and 'util_est eq. ewma' would
help here for easier understanding.
>>> Just remove the check as either:
>>> 1.
>>> There is idle time on the CPU. In that case the util_avg value of the
>>> task is actually correct. It is possible that the task missed a
>>> deadline and appears bigger, but this is also the case when the
>>> util_avg of the task is lower than the maximum CPU capacity.
>>> 2.
>>> There is no idle time. In that case, the util_avg value might aswell
>>> be an under estimation of the size of the task.
>>> It is possible that undesired frequency spikes will appear when the
>>> task is later enqueued with an inflated util_est value, but the
>>> frequency spike might aswell be deserved. The absence of idle time
>>> prevents from drawing any conclusion.
>>>
>>> Signed-off-by: Pierre Gondois <pierre.gondois@xxxxxxx>
>>
>> This change looks reasonable to me. Did you face problems related to
>> this in a particular use case ?
>
> I think it was more related to the fact util_est is not decayed when:
> (runnable - util_avg) > margin
>
> This patch slightly helps to decay, but not that much.
Some of the 'stress-ng --class scheduler' seem to be be sensitive in
this regard. Haven't looked deeper into this.
[...]