Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

From: Tim Chen
Date: Thu Apr 18 2024 - 19:45:45 EST


On Thu, 2024-04-18 at 21:03 +0800, Chen Yu wrote:
> Hi Xuewen,
>
> On 2024-04-18 at 10:57:22 +0800, Xuewen Yan wrote:
> > Hi Yu
> >
> > On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > >
> > > On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > > > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > > > >
> > > > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > > > thus:
> > > > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > > > delta * 2677 = 7733399554989275921
> > > > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > > > negative value into a positive value, thus eligible check fails.
> > > > > > >
> > > > > > > Indeed.
> > > > > >
> > > > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > > > >
> > > > >
> > > > > This was the dump of all the entities on the tree, from left to right,
> > > >
> > > > Oh, my bad, I thought it was the pick path.
> > > >
> > > > > and also from top down in middle order traverse, when this issue happens:
> > > > >
> > > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > > >
> > > > >
> > > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > >
> > > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > >
> > > > > The tree looks like:
> > > > >
> > > > > se (0xec1234e0)
> > > > > |
> > > > > |
> > > > > ----> se (0xec4fcf20)
> > > > >
> > > > >
> > > > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > > > vruntime are both 763383370431, which is aligned with
> > > > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > > > updated correctly, because it is monotonic increasing.
> > > >
> > > > Right.
> > > >
> > > > > My guess is that, for some reason, one newly forked se in a newly
> > > > > created task group, in the rb-tree has not been picked for a long
> > > > > time(maybe not eligible). Its vruntime stopped at the negative
> > > > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> > > >
> > > > I'll have to do the math again, but that's something in the order of not
> > > > picking a task in about a day, that would be 'bad' :-)
> > > >
> > > > Is there any sane way to reproduce this, and how often does it happen?
> > >
> > > After adding some ftrace in place_entity() and pick_eevdf(), with the
> > > help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> > > is very small seems to be related to task group's reweight_entity():
> > >
> > > vlag = (s64)(avruntime - se->vruntime);
> > > vlag = div_s64(vlag * old_weight, weight);
> > > se->vruntime = avruntime - vlag;
> > >
> > > The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> > > if the new weight is very small, which is very likely, then the vlag
> > > could be very large, results in a very small vruntime.
> > >
> > >
> > > The followings are the details why I think above could bring problems:
> > >
> > > Here is the debug log printed by place_entity():
> > >
> > >
> > > [ 397.597268]cfs_rq:0xe75f7100
> > > cfs_rq.avg_vruntime:-1111846207333767
> > > cfs_rq.min_vruntime:810640668779
> > > avg_vruntime():686982466017
> > > curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> > > leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> > > ..
> > >
> > > [ 397.877251]cfs_rq:0xe75f7100
> > > cfs_rq.avg_vruntime:-759390883821798
> > > cfs_rq.min_vruntime:810640668779
> > > avg_vruntime(): 689577229374
> > > curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> > > leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
> > >
> > >
> > > The leftmost se is a task group, its vruntime reduces from 332464705486 to
> > > -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> > > on the tree entities.
> > >
> > > Back to reweight_entity():
> > > vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> > > vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> > > se->vruntime = avruntime - vlag = -59751717438670;
> > >
> > > the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> > > consider that the avg_vruntime() vary.
> > >
> > > Then later this leftmost se has changed its load back and forth, and when the load is 2,
> > > the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> > > eligible check:
> > >
> > > [ 398.011991]cfs_rq:0xe75f7100
> > > cfs_rq.avg_vruntime:-11875977385353427
> > > cfs_rq.min_vruntime:810640668779
> > > cfs_rq.avg_load:96985
> > > leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
> > >
> > > vruntime_eligible()
> > > {
> > >
> > > key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> > > key * avg_load overflow s64...
> > > }
> > >
> > > As a result the leftmost one can not be picked, and NULL is returned.
> > >
> > > One workaround patch I'm thinking of, if this analysis is in the
> > > right direction, maybe I can have a test later:
> > >
> > > thanks,
> > > Chenyu
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6e0968fb9ba8..7ab26cdc3487 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > > * = V - vl'
> > > */
> > > if (avruntime != se->vruntime) {
> > > + s64 limit;
> > > +
> > > vlag = (s64)(avruntime - se->vruntime);
> > > vlag = div_s64(vlag * old_weight, weight);
> > > + /* TBD: using old weight or new weight? */
> > > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > > + vlag = clamp(lag, -limit, limit);
> > > se->vruntime = avruntime - vlag;
> > > }
> > >
> >
> > According to previous discussion:
> > https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@xxxxxxxxxxxxxx/
> >
> > Could this patch avoid this problem?
> >
>
> 100 cycles has passed with the following change, with the fork() cased removed.
> There was no NULL pointer issue observed so far. Meanwhile, I've launched a test
> with Tianchen's patch set on top of this one to see if everything goes well.
>
> Can you take a look if the following change make sense, and if yes, could you
> send v2 and CC the reporters, so they could also have a try if this helps.
>
> Reported-by: Sergei Trofimovich <slyich@xxxxxxxxx>
> Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@xxxxxxx/
> Reported-by: Igor Raits <igor@xxxxxxxxxxxx>
> Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@xxxxxxxxxxxxxx/
> Reported-by: Breno Leitao <leitao@xxxxxxxxxx>
> Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@xxxxxxxxx/
> Reported-by: Yujie Liu <yujie.liu@xxxxxxxxx>
>
>
> And in v2, it would be helpful to include the reason why we did this change, how
> about something like this:
>
> "
> kernel encounters the following error when running workload:
>
> BUG: kernel NULL pointer dereference, address: 0000002c
> EIP: set_next_entity (fair.c:?)
>
> which was caused by NULL pointer returned by pick_eevdf().
>
> Further investigation has shown that, the entity_eligible() has an
> false-negative issue when the entity's vruntime is far behind the
> cfs_rq.min_vruntime that, the (vruntime - cfs_rq->min_vruntime) * load
> caused a s64 overflow, thus every entity on the rb-tree is not
> eligible, which results in a NULL candidate.
>
> The reason why entity's vruntime is far behind the cfs_rq.min_vruntime
> is because during a on_rq task group's update_cfs_group()->reweight_eevdf(),
> there is no limit on the new entity's vlag. If the new weight is much
> smaller than the old one,

Probably worth mentioning in commit log that the vlag could be many times larger than the scheduling
entity's time slice leading to vruntime overflow on going to a really small weight. We're limiting
the lag update on reweight_eevdf() to a max of twice the sched entity's time slice to prevent lag overflow, just
as we are doing on update_sched_entity().

This will help to clarify the reason for the change when looking at this patch for the first time.

Tim

>
> vlag = div_s64(vlag * old_weight, weight)
>
> generates a huge vlag, and results in very small(negative) vruntime.
>
> Thus limit the range of vlag accordingly.
>
> "
>
>