Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

From: Dietmar Eggemann
Date: Thu Sep 05 2024 - 09:03:00 EST


On 29/08/2024 17:42, Hongyan Xia wrote:
> On 22/08/2024 15:58, Vincent Guittot wrote:
>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>> <vincent.guittot@xxxxxxxxxx> wrote:
>>>
>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@xxxxxxx> wrote:
>>>>
>>>> Vincent,
>>>>
>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@xxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@xxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> Sorry for bombarding this thread in the last couple of days.
>>>>>>>>> I'm seeing
>>>>>>>>> several issues in the latest tip/sched/core after these patches
>>>>>>>>> landed.
>>>>>>>>>
>>>>>>>>> What I'm now seeing seems to be an unbalanced call of util_est.
>>>>>>>>> First, I applied
>>>>>>>>
>>>>>>>> I also see a remaining util_est for idle rq because of an unbalance
>>>>>>>> call of util_est_enqueue|dequeue
>>>>>>>>
>>>>>>>
>>>>>>> I can confirm issues with the utilization values and frequencies
>>>>>>> being driven
>>>>>>> seemingly incorrectly, in particular for little cores.
>>>>>>>
>>>>>>> What I'm seeing with the stock series is high utilization values
>>>>>>> for some tasks
>>>>>>> and little cores having their frequencies maxed out for extended
>>>>>>> periods of
>>>>>>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the
>>>>>>> cores are mostly
>>>>>>> idle. But whenever certain tasks get scheduled there, they have a
>>>>>>> very high util
>>>>>>> level and so the frequency is kept at max.
>>>>>>>
>>>>>>> As a consequence this drives up power usage.
>>>>>>>
>>>>>>> I gave Hongyan's draft fix a try and observed a much more
>>>>>>> reasonable behavior for
>>>>>>> the util numbers and frequencies being used for the little cores.
>>>>>>> With his fix,
>>>>>>> I can also see lower energy use for my specific benchmark.
>>>>>>
>>>>>> The main problem is that the util_est of a delayed dequeued task
>>>>>> remains on the rq and keeps the rq utilization high and as a result
>>>>>> the frequency higher than needed.
>>>>>>
>>>>>> The below seems to works for me and keep sync the enqueue/dequeue of
>>>>>> uti_test with the enqueue/dequeue of the task as if de dequeue was
>>>>>> not
>>>>>> delayed
>>>>>>
>>>>>> Another interest is that we will not try to migrate a delayed dequeue
>>>>>> sleeping task that doesn't actually impact the current load of the
>>>>>> cpu
>>>>>> and as a result will not help in the load balance. I haven't yet
>>>>>> fully
>>>>>> checked what would happen with hotplug
>>>>>
>>>>> Thanks. Those are good points. Let me go and try your patch.
>>>>
>>>> I gave your fix a try, but it seems to make things worse. It is
>>>> comparable
>>>> to the behavior we had before Peter added the uclamp imbalance fix,
>>>> so I
>>>> believe there is something incorrect there.
>>>
>>> we need to filter case where task are enqueued/dequeued several
>>> consecutive times. That's what I'm look now
>>
>> I just realize before that It's not only util_est but the h_nr_running
>> that keeps delayed tasks as well so all stats of the rq are biased:
>> h_nr_running, util_est, runnable avg and load_avg.
>
> After staring at the code even more, I think the situation is worse.
>
> First thing is that uclamp might also want to be part of these stats
> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
> delayed dequeue which needs to be specially handled in the same way. The
> current way of handling uclamp in core.c misses the frequency update,
> like I commented before.
>
> Second, there is also an out-of-sync issue in update_load_avg(). We only
> update the task-level se in delayed dequeue and requeue, but we return
> early and the upper levels are completely skipped, as if the delayed
> task is still on rq. This de-sync is wrong.

I had a look at the util_est issue.

This keeps rq->cfs.avg.util_avg sane for me with
SCHED_FEAT(DELAY_DEQUEUE, true):

-->8--