Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

From: Imran Khan
Date: Tue Dec 19 2023 - 01:42:35 EST


Hello Vincent,


On 15/12/2023 8:59 pm, Imran Khan wrote:
> Hello Vincent,
> Thanks a lot for having a look and getting back.
>
> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
>>>
>>> It has been found that sometimes a task_group has some residual
>>> load_avg even though the load average at each of its owned queues
>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>>> tg_load_avg_contrib have become 0 for a long time.
>>> Under this scenario if another task starts running in this task_group,
>>> it does not get proper time share on CPU since pre-existing
>>> load average of task group inversely impacts the new task's CPU share
>>> on each CPU.
>>>
>>> This change looks for the condition when a task_group has no running
>>> tasks and sets the task_group's load average to 0 in such cases, so
>>> that tasks that run in future under this task_group get the CPU time
>>> in accordance with the current load.
>>>
>>> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
>>> ---
>>>
>>
>> [...]
>>
>>>
>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>>> host side.
>>
>> Could it be the root cause of your problem ?
>>
>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
>> then unplugged, have not been correctly removed from tg->load_avg. If
>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
>> tg->load_avg should be 0 too.
>>
> Agree and this was my understanding as well. The issue only happens
> with large number of CPUs. For example if I go from 4 to 8 and back to
> 4 , the issue does not happen and even if it happens the residual load
> avg is very little.
>
>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
>> removed from tg->load_avg when you unplug the CPUs ? I can easily
>> imagine that the rate limit can skip some update of tg- >load_avg
>> while offlining the cpu
>>
>
> I will try to trace it but just so you know this issue is happening on other
> kernel versions (which don't have rate limit feature) as well. I started
> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
>
I collected some debug trace to understand the missing load avg
context better. From the traces it looks like during scale down,
the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
properly for CPU(s) being hotplugged out.

For example if we look at following snippet (I have kept
only the relevant portion of trace in the mail), we can see that,
in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
both the load avg and contribution of this cfs_rq were 1024.
So delta was zero and this contribution eventually remains undeducted.
In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
offlined.


cpuhp/15-131605 [015] d... 6112.350658: update_tg_load_avg.constprop.124:
cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 0 delta = 0 ###
systemd-udevd-894 [005] d... 6112.351096: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 0 delta = 1024 ###
systemd-udevd-894 [005] d... 6112.351165: update_tg_load_avg.constprop.124:
cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1024 delta = 10 ###

.........................
.........................
cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 3085 delta = 0 ###
.........................
sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 4041 delta = 1024 ###
.........................
cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 3085 delta = 0 ###
..........................
sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 4041 delta = 1024 ###
..........................
systemd-run-142416 [011] d.h. 6112.506547: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
tg->load_avg = 3010 delta = 0 ###
..........................
systemd-run-142416 [011] d.h. 6112.507546: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]

..........................
..........................
<idle>-0 [001] d.s. 6113.868542: update_tg_load_avg.constprop.124:
cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1027 delta = 0 ###
<idle>-0 [001] d.s. 6113.869542: update_tg_load_avg.constprop.124:
cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1027 delta = 0 ###
<idle>-0 [001] d.s. 6113.870541: update_tg_load_avg.constprop.124:
cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1027 delta = 0 ###


If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
the contribution of this cfs_rq will get deducted from tg->load_avg.
It looks like during hotplug load of one or more tasks, being migrated are
not getting accounted in the source cfs_rq and this is ending up as residual
load_avg at task_group (if these tasks are members of a task_group).

Moreover this looks racy and dependent on number of CPUs or some delay.
For example for scale down from 124 to 4 CPUs I always hit this issue but
for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
Also for the cases when residual load_avg at task group is low (like < 10),
I can see that both of my test cgroups get similar CPU times which further
proves that the unaccounted load avg ending up in a task_group is eventually
leading to uneven CPU allotment between task groups.


I am debugging it further but in the mean time if you have some suggestions or
need traces from some specific portion of sched code, please let me know.

Thanks,
Imran

> Thanks,
> Imran
>