Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

From: Xuewen Yan
Date: Wed Apr 17 2024 - 22:57:45 EST


Hi Yu

On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>
> On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > >
> > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > thus:
> > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > delta * 2677 = 7733399554989275921
> > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > negative value into a positive value, thus eligible check fails.
> > > > >
> > > > > Indeed.
> > > >
> > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > >
> > >
> > > This was the dump of all the entities on the tree, from left to right,
> >
> > Oh, my bad, I thought it was the pick path.
> >
> > > and also from top down in middle order traverse, when this issue happens:
> > >
> > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > >
> > >
> > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > >
> > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > >
> > > The tree looks like:
> > >
> > > se (0xec1234e0)
> > > |
> > > |
> > > ----> se (0xec4fcf20)
> > >
> > >
> > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > vruntime are both 763383370431, which is aligned with
> > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > updated correctly, because it is monotonic increasing.
> >
> > Right.
> >
> > > My guess is that, for some reason, one newly forked se in a newly
> > > created task group, in the rb-tree has not been picked for a long
> > > time(maybe not eligible). Its vruntime stopped at the negative
> > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> >
> > I'll have to do the math again, but that's something in the order of not
> > picking a task in about a day, that would be 'bad' :-)
> >
> > Is there any sane way to reproduce this, and how often does it happen?
>
> After adding some ftrace in place_entity() and pick_eevdf(), with the
> help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> is very small seems to be related to task group's reweight_entity():
>
> vlag = (s64)(avruntime - se->vruntime);
> vlag = div_s64(vlag * old_weight, weight);
> se->vruntime = avruntime - vlag;
>
> The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> if the new weight is very small, which is very likely, then the vlag
> could be very large, results in a very small vruntime.
>
>
> The followings are the details why I think above could bring problems:
>
> Here is the debug log printed by place_entity():
>
>
> [ 397.597268]cfs_rq:0xe75f7100
> cfs_rq.avg_vruntime:-1111846207333767
> cfs_rq.min_vruntime:810640668779
> avg_vruntime():686982466017
> curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> ..
>
> [ 397.877251]cfs_rq:0xe75f7100
> cfs_rq.avg_vruntime:-759390883821798
> cfs_rq.min_vruntime:810640668779
> avg_vruntime(): 689577229374
> curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
>
>
> The leftmost se is a task group, its vruntime reduces from 332464705486 to
> -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> on the tree entities.
>
> Back to reweight_entity():
> vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> se->vruntime = avruntime - vlag = -59751717438670;
>
> the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> consider that the avg_vruntime() vary.
>
> Then later this leftmost se has changed its load back and forth, and when the load is 2,
> the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> eligible check:
>
> [ 398.011991]cfs_rq:0xe75f7100
> cfs_rq.avg_vruntime:-11875977385353427
> cfs_rq.min_vruntime:810640668779
> cfs_rq.avg_load:96985
> leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
>
> vruntime_eligible()
> {
>
> key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> key * avg_load overflow s64...
> }
>
> As a result the leftmost one can not be picked, and NULL is returned.
>
> One workaround patch I'm thinking of, if this analysis is in the
> right direction, maybe I can have a test later:
>
> thanks,
> Chenyu
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e0968fb9ba8..7ab26cdc3487 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> * = V - vl'
> */
> if (avruntime != se->vruntime) {
> + s64 limit;
> +
> vlag = (s64)(avruntime - se->vruntime);
> vlag = div_s64(vlag * old_weight, weight);
> + /* TBD: using old weight or new weight? */
> + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> + vlag = clamp(lag, -limit, limit);
> se->vruntime = avruntime - vlag;
> }
>

According to previous discussion:
https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@xxxxxxxxxxxxxx/

Could this patch avoid this problem?

BR

> --
> 2.25.1
>
>
>