Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

From: Peter Zijlstra
Date: Fri Jan 10 2025 - 06:58:19 EST


On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
> Hi Peter,
>
> Thanks for all your hard work on this.
>
> On 2025.01.09 03:00 Peter Zijlstra wrote:
>
> ...
>
> > This made me have a very hard look at reweight_entity(), and
> > specifically the ->on_rq case, which is more prominent with
> > DELAY_DEQUEUE.
> >
> > And indeed, it is all sorts of broken. While the computation of the new
> > lag is correct, the computation for the new vruntime, using the new lag
> > is broken for it does not consider the logic set out in place_entity().
> >
> > With the below patch, I now see things like:
> >
> > migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
> > { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
> } ->
> > { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
> }
> > migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
> > { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
> 6316614641111 } ->
> > { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
> }
> >
> > Which isn't perfect yet, but much closer.
>
> Agreed.
> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
> It still compares to the "b12" kernel (the last good one in the kernel bisection).
> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
> verses 6 seconds without the patch.

Progress!

> I left things running for many hours and will let it continue overnight.
> There seems to have been an issue at one spot in time:

Right, so by happy accident I also left mine running overnight and I
think I caught one of those weird spikes. It took a bit of staring to
figure out what went wrong this time, but what I now think is the thing
that sets off the chain of fail is a combination of DELAY_DEQUEUE and
the cgroup reweight stuff.

Specifically, when a cgroup's CPU queue becomes empty,
calc_group_shares() will drop its weight to the floor.

Now, normally, when a queue goes empty, it gets dequeued from its
parent and its weight is immaterial. However, with DELAY_DEQUEUE the
thing will stick around for a while -- at a very low weight.

What makes this esp. troublesome is that even for an active cgroup (like
the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
like). Worse, the avg_vruntime thing uses load_scale_down() because u64
just isn't all that big :/

(if only all 64bit machines could do 128bit divisions as cheaply as x86_64)

This means that on a 16 CPU machine, the weight of a 'normal' all busy
queue will be 64, and the weight of an empty queue will be 2, which means
the effect of the ginormous lag on the avg_vruntime is fairly
significant, pushing it wildly off balance and affecting placement of
new tasks.

Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
competition as the entity was.

As such, we should not adjust the weight of an empty queue.

I've started a new run, and some 15 minutes of runtime show nothing
interesting, I'll have to let it run for a while again.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..93644b3983d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
long shares;

- if (!gcfs_rq)
+ /*
+ * When a group becomes empty, preserve its weight. This matters for
+ * DELAY_DEQUEUE.
+ */
+ if (!gcfs_rq || !gcfs_rq->load.weight)
return;

if (throttled_hierarchy(gcfs_rq))