Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

From: Aaron Lu
Date: Wed Jul 19 2023 - 01:18:57 EST


On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@xxxxxxxxx> wrote:
> >
> > When a workload involves many wake time task migrations, tg->load_avg
> > can be heavily contended among CPUs because every migration involves
> > removing the task's load from its src cfs_rq and attach that load to
> > its new cfs_rq. Both the remove and attach requires an update to
> > tg->load_avg as well as propagating the change up the hierarchy.
> >
> > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > migration number is 11millions. Since the workload can trigger many
> > wakeups and migrations, the access(both read and write) to tg->load_avg
> > can be unbound. For the above said workload, the profile shows
> > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > migration number is 14millions; update_cfs_group() costs ~25% and
> > update_load_avg() costs ~16%.
> >
> > This patch is an attempt to reduce the cost of accessing tg->load_avg.
>
> here you mention tg->load_avg which is updated with update_tg_load_avg()
>
> but your patch below modifies the local update of cfs's util_avg,
> runnable_avg and load_avg which need to be maintained up to date

Yes, since it delayed propagating the removed load, the upper cfs_rqs'
*_avg could be updated later than current code.

> You should be better to delay or rate limit the call to
> update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> access tg->load_avg and are the culprit instead of modifying other
> place.

Thanks for the suggestion and I think it makes perfect sense.

I tried below to rate limit the update to tg->load_avg at most once per
ms in update_tg_load_avg():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e48fd0e6982d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+ u64 now = cfs_rq_clock_pelt(cfs_rq);

/*
* No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
if (cfs_rq->tg == &root_task_group)
return;

- if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+ if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
+ abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+ cfs_rq->last_update_tg_load_avg = now;
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..b5201d44d820 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {

#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
+ u64 last_update_tg_load_avg;
long propagate;
long prop_runnable_sum;

>From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
this alone already delivers good results. For postgres_sysbench, the two
functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
functions cost dropped to ~2% and ~4%. Togerther with some more rate
limiting on the read side, I think the end result will be better. Does
the above look reasonable to you?

Alternatively, I can remove some callsites of update_tg_load_avg() like
you suggested below and only call update_tg_load_avg() when cfs_rq is
decayed(really just decayed, not when it detected it has removed load
pending or load propagated from its children). This way it would give us
similar result as above(roughly once per ms).

>
> Have you tried to remove update_cfs_group() from enqueue/dequeue and
> only let the tick update the share periodically ?

patch4 kind of did that :-)

>
> Have you tried to make update_tg_load_avg() return early ? the change
> of cfs' load_avg is written in the tg->load_avg only when the change
> is bigger than 1.5%. maybe increase it to 6% ?

Yes, increase the delta is also a good way to limit the update to
tg->load_avg. Will try that too.

>
> Or like for update_cfs_group, remove it from attach/detach entity and
> let the periodic update to propagate the change
>
> But please don't touch local update of *_avg

OK I see.

Thanks again for the comments, they are very helpful.