Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

From: Vincent Guittot
Date: Tue Dec 19 2023 - 13:49:58 EST


Hi Imram,

On Tue, 19 Dec 2023 at 07:42, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
>
> Hello Vincent,
>
>
> On 15/12/2023 8:59 pm, Imran Khan wrote:
> > Hello Vincent,
> > Thanks a lot for having a look and getting back.
> >
> > On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> >> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
> >>>
> >>> It has been found that sometimes a task_group has some residual
> >>> load_avg even though the load average at each of its owned queues
> >>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> >>> tg_load_avg_contrib have become 0 for a long time.
> >>> Under this scenario if another task starts running in this task_group,
> >>> it does not get proper time share on CPU since pre-existing
> >>> load average of task group inversely impacts the new task's CPU share
> >>> on each CPU.
> >>>
> >>> This change looks for the condition when a task_group has no running
> >>> tasks and sets the task_group's load average to 0 in such cases, so
> >>> that tasks that run in future under this task_group get the CPU time
> >>> in accordance with the current load.
> >>>
> >>> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
> >>> ---
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> >>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> >>> host side.
> >>
> >> Could it be the root cause of your problem ?
> >>
> >> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> >> then unplugged, have not been correctly removed from tg->load_avg. If
> >> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> >> tg->load_avg should be 0 too.
> >>
> > Agree and this was my understanding as well. The issue only happens
> > with large number of CPUs. For example if I go from 4 to 8 and back to
> > 4 , the issue does not happen and even if it happens the residual load
> > avg is very little.
> >
> >> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> >> removed from tg->load_avg when you unplug the CPUs ? I can easily
> >> imagine that the rate limit can skip some update of tg- >load_avg
> >> while offlining the cpu
> >>
> >
> > I will try to trace it but just so you know this issue is happening on other
> > kernel versions (which don't have rate limit feature) as well. I started
> > with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
> >
> I collected some debug trace to understand the missing load avg
> context better. From the traces it looks like during scale down,
> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
> properly for CPU(s) being hotplugged out.

Your traces are interesting and I think that
task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
call update_tg_load_avg() to reflect that in tg->load_avg.

Could you try the patch below ? It forces the scheduler to clear the
contribution of all cfs_rq of a CPU that becomes offline.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 466e01b2131f..e5da5eaab6ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
if (cfs_rq->tg == &root_task_group)
return;

+
+ /* rq has been offline and doesn't contribute anymore to the share */
+ if (!cpu_active(cpu_of(rq_of(cfs_rq))))
+ return;
+
/*
* For migration heavy workloads, access to tg->load_avg can be
* unbound. Limit the update rate to at most once per ms.
@@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
}
}

+static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
+{
+ long delta;
+ u64 now;
+
+ /*
+ * No need to update load_avg for root_task_group as it is not used.
+ */
+ if (cfs_rq->tg == &root_task_group)
+ return;
+
+ now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+ delta = 0 - cfs_rq->tg_load_avg_contrib;
+ atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ cfs_rq->tg_load_avg_contrib = 0;
+ cfs_rq->last_update_tg_load_avg = now;
+}
+
+/* cpu offline callback */
+static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
+{
+ struct task_group *tg;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * The rq clock has already been updated in the
+ * set_rq_offline(), so we should skip updating
+ * the rq clock again in unthrottle_cfs_rq().
+ */
+ rq_clock_start_loop_update(rq);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(tg, &task_groups, list) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ clear_tg_load_avg(cfs_rq);
+ }
+ rcu_read_unlock();
+
+ rq_clock_stop_loop_update(rq);
+}
+
/*
* Called within set_task_rq() right before setting a task's CPU. The
* caller only guarantees p->pi_lock is held; no other assumptions,
@@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)

/* Ensure any throttled groups are reachable by pick_next_task */
unthrottle_offline_cfs_rqs(rq);
+
+ /* Ensure that we remove rq contribution to group share */
+ clear_tg_offline_cfs_rqs(rq);
}

#endif /* CONFIG_SMP */


>
> For example if we look at following snippet (I have kept
> only the relevant portion of trace in the mail), we can see that,
> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
> both the load avg and contribution of this cfs_rq were 1024.
> So delta was zero and this contribution eventually remains undeducted.
> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
> offlined.
>
>
> cpuhp/15-131605 [015] d... 6112.350658: update_tg_load_avg.constprop.124:
> cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 0 delta = 0 ###
> systemd-udevd-894 [005] d... 6112.351096: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 0 delta = 1024 ###
> systemd-udevd-894 [005] d... 6112.351165: update_tg_load_avg.constprop.124:
> cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1024 delta = 10 ###
>
> .........................
> .........................
> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 3085 delta = 0 ###
> .........................
> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 4041 delta = 1024 ###
> .........................
> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 3085 delta = 0 ###
> ..........................
> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 4041 delta = 1024 ###
> ..........................
> systemd-run-142416 [011] d.h. 6112.506547: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> tg->load_avg = 3010 delta = 0 ###
> ..........................
> systemd-run-142416 [011] d.h. 6112.507546: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
>
> ..........................
> ..........................
> <idle>-0 [001] d.s. 6113.868542: update_tg_load_avg.constprop.124:
> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1027 delta = 0 ###
> <idle>-0 [001] d.s. 6113.869542: update_tg_load_avg.constprop.124:
> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1027 delta = 0 ###
> <idle>-0 [001] d.s. 6113.870541: update_tg_load_avg.constprop.124:
> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1027 delta = 0 ###
>
>
> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
> the contribution of this cfs_rq will get deducted from tg->load_avg.
> It looks like during hotplug load of one or more tasks, being migrated are
> not getting accounted in the source cfs_rq and this is ending up as residual
> load_avg at task_group (if these tasks are members of a task_group).
>
> Moreover this looks racy and dependent on number of CPUs or some delay.
> For example for scale down from 124 to 4 CPUs I always hit this issue but
> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
> Also for the cases when residual load_avg at task group is low (like < 10),
> I can see that both of my test cgroups get similar CPU times which further
> proves that the unaccounted load avg ending up in a task_group is eventually
> leading to uneven CPU allotment between task groups.
>
>
> I am debugging it further but in the mean time if you have some suggestions or
> need traces from some specific portion of sched code, please let me know.
>
> Thanks,
> Imran
>
> > Thanks,
> > Imran
> >