Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found
From: Chen Yu
Date: Thu Apr 18 2024 - 09:05:24 EST
Hi Xuewen,
On 2024-04-18 at 10:57:22 +0800, Xuewen Yan wrote:
> Hi Yu
>
> On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> >
> > On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > > >
> > > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > > thus:
> > > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > > delta * 2677 = 7733399554989275921
> > > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > > negative value into a positive value, thus eligible check fails.
> > > > > >
> > > > > > Indeed.
> > > > >
> > > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > > >
> > > >
> > > > This was the dump of all the entities on the tree, from left to right,
> > >
> > > Oh, my bad, I thought it was the pick path.
> > >
> > > > and also from top down in middle order traverse, when this issue happens:
> > > >
> > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > >
> > > >
> > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > The tree looks like:
> > > >
> > > > se (0xec1234e0)
> > > > |
> > > > |
> > > > ----> se (0xec4fcf20)
> > > >
> > > >
> > > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > > vruntime are both 763383370431, which is aligned with
> > > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > > updated correctly, because it is monotonic increasing.
> > >
> > > Right.
> > >
> > > > My guess is that, for some reason, one newly forked se in a newly
> > > > created task group, in the rb-tree has not been picked for a long
> > > > time(maybe not eligible). Its vruntime stopped at the negative
> > > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> > >
> > > I'll have to do the math again, but that's something in the order of not
> > > picking a task in about a day, that would be 'bad' :-)
> > >
> > > Is there any sane way to reproduce this, and how often does it happen?
> >
> > After adding some ftrace in place_entity() and pick_eevdf(), with the
> > help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> > is very small seems to be related to task group's reweight_entity():
> >
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > se->vruntime = avruntime - vlag;
> >
> > The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> > if the new weight is very small, which is very likely, then the vlag
> > could be very large, results in a very small vruntime.
> >
> >
> > The followings are the details why I think above could bring problems:
> >
> > Here is the debug log printed by place_entity():
> >
> >
> > [ 397.597268]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-1111846207333767
> > cfs_rq.min_vruntime:810640668779
> > avg_vruntime():686982466017
> > curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> > leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> > ..
> >
> > [ 397.877251]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-759390883821798
> > cfs_rq.min_vruntime:810640668779
> > avg_vruntime(): 689577229374
> > curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> > leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
> >
> >
> > The leftmost se is a task group, its vruntime reduces from 332464705486 to
> > -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> > on the tree entities.
> >
> > Back to reweight_entity():
> > vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> > vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> > se->vruntime = avruntime - vlag = -59751717438670;
> >
> > the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> > consider that the avg_vruntime() vary.
> >
> > Then later this leftmost se has changed its load back and forth, and when the load is 2,
> > the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> > eligible check:
> >
> > [ 398.011991]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-11875977385353427
> > cfs_rq.min_vruntime:810640668779
> > cfs_rq.avg_load:96985
> > leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
> >
> > vruntime_eligible()
> > {
> >
> > key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> > key * avg_load overflow s64...
> > }
> >
> > As a result the leftmost one can not be picked, and NULL is returned.
> >
> > One workaround patch I'm thinking of, if this analysis is in the
> > right direction, maybe I can have a test later:
> >
> > thanks,
> > Chenyu
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e0968fb9ba8..7ab26cdc3487 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > * = V - vl'
> > */
> > if (avruntime != se->vruntime) {
> > + s64 limit;
> > +
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > + /* TBD: using old weight or new weight? */
> > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > + vlag = clamp(lag, -limit, limit);
> > se->vruntime = avruntime - vlag;
> > }
> >
>
> According to previous discussion:
> https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@xxxxxxxxxxxxxx/
>
> Could this patch avoid this problem?
>
100 cycles has passed with the following change, with the fork() cased removed.
There was no NULL pointer issue observed so far. Meanwhile, I've launched a test
with Tianchen's patch set on top of this one to see if everything goes well.
Can you take a look if the following change make sense, and if yes, could you
send v2 and CC the reporters, so they could also have a try if this helps.
Reported-by: Sergei Trofimovich <slyich@xxxxxxxxx>
Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@xxxxxxx/
Reported-by: Igor Raits <igor@xxxxxxxxxxxx>
Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@xxxxxxxxxxxxxx/
Reported-by: Breno Leitao <leitao@xxxxxxxxxx>
Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@xxxxxxxxx/
Reported-by: Yujie Liu <yujie.liu@xxxxxxxxx>
And in v2, it would be helpful to include the reason why we did this change, how
about something like this:
"
kernel encounters the following error when running workload:
BUG: kernel NULL pointer dereference, address: 0000002c
EIP: set_next_entity (fair.c:?)
which was caused by NULL pointer returned by pick_eevdf().
Further investigation has shown that, the entity_eligible() has an
false-negative issue when the entity's vruntime is far behind the
cfs_rq.min_vruntime that, the (vruntime - cfs_rq->min_vruntime) * load
caused a s64 overflow, thus every entity on the rb-tree is not
eligible, which results in a NULL candidate.
The reason why entity's vruntime is far behind the cfs_rq.min_vruntime
is because during a on_rq task group's update_cfs_group()->reweight_eevdf(),
there is no limit on the new entity's vlag. If the new weight is much
smaller than the old one,
vlag = div_s64(vlag * old_weight, weight)
generates a huge vlag, and results in very small(negative) vruntime.
Thus limit the range of vlag accordingly.
"
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31bca05c3612..9f203012e8f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
*
* XXX could add max_slice to the augmented data to track this.
*/
+
+static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
+{
+ s64 limit;
+
+ limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+ return clamp(lag, -limit, limit);
+}
+
static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- s64 lag, limit;
+ s64 lag;
SCHED_WARN_ON(!se->on_rq);
lag = avg_vruntime(cfs_rq) - se->vruntime;
- limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
- se->vlag = clamp(lag, -limit, limit);
+ se->vlag = limit_entity_lag(se, lag);
}
/*
@@ -3721,6 +3729,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
if (avruntime != se->vruntime) {
vlag = (s64)(avruntime - se->vruntime);
vlag = div_s64(vlag * old_weight, weight);
+ vlag = limit_entity_lag(se, vlag);
se->vruntime = avruntime - vlag;
}
@@ -3768,6 +3777,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
update_load_set(&se->load, weight);
+ if (!se->on_rq)
+ se->vlag = limit_entity_lag(se, se->vlag);
+
#ifdef CONFIG_SMP
do {
u32 divider = get_pelt_divider(&se->avg);
--
2.25.1
thanks,
Chenyu