Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

From: Chen Yu
Date: Thu Jun 06 2024 - 22:38:53 EST


On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
>
>
> > On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> >
> >
> > Sorry for the late reply and thanks for help clarify this. Yes, this is
> > what my previous concern was:
> > 1. It does not consider the cgroup and does not check preemption in the same
> > level which is covered by find_matching_se().
> > 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> > later pick_eevdf() will check the eligible of current anyway. But
> > as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> > I just wonder if we could leverage the cfs_rq->next to store the next
> > candidate, so it can be picked directly in the 2nd pick as a fast path?
> > Something like below untested:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..f716646d595e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> > static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> > {
> > struct task_struct *curr = rq->curr;
> > - struct sched_entity *se = &curr->se, *pse = &p->se;
> > + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> > struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> > int cse_is_idle, pse_is_idle;
> >
> > @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> > /*
> > * XXX pick_eevdf(cfs_rq) != se ?
> > */
> > - if (pick_eevdf(cfs_rq) == pse)
> > + next = pick_eevdf(cfs_rq);
> > + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> > + set_next_buddy(next);
> > +
> > + if (next == pse)
> > goto preempt;
> >
> > return;
> >
> >
> > thanks,
> > Chenyu
>
> Hi Chen
>
> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
> the RB-tree twice, I initially had two methods in mind.
> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
> This idea is similar to the one you proposed this time.
> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
> Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
> process to schedule' are two different things.

I agree, and it seems that in current eevdf implementation the former relies on the latter.

> 'check_preempt_wakeup_fair' is not just to
> check if the newly awakened process should preempt the current process; it can also serve
> as an opportunity to check whether any other processes should preempt the current one,
> thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
> the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
> then the current process will still not be preempted.

I thought Mike has proposed a patch to deal with this scenario you mentioned above:
https://lore.kernel.org/lkml/e17d3d90440997b970067fe9eaf088903c65f41d.camel@xxxxxx/

And I suppose you are refering to increase the preemption chance on current rather than reducing
the invoke of pick_eevdf() in check_preempt_wakeup_fair().

> Therefore, I posted the v2 PATCH.
> The implementation of v2 PATCH might express this point more clearly.
> https://lore.kernel.org/lkml/20240529141806.16029-1-spring.cxz@xxxxxxxxx/T/
>

Let me take a look at it and do some tests.

> I previously implemented and tested both of these methods, and the test results showed that
> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
> think about it, perhaps method 1 could also be viable at the same time. :)
>

Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
wakeup latency/request latency are both reduced. The following is the schbench result on a
240 CPUs system:

NO_NEXT_BUDDY
Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
       50.0th: 6 (429125 samples)
       90.0th: 14 (682355 samples)
      * 99.0th: 29 (126695 samples)
       99.9th: 529 (14603 samples)
       min=1, max=4741
Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
       50.0th: 14992 (550939 samples)
       90.0th: 15376 (668687 samples)
      * 99.0th: 15600 (128111 samples)
       99.9th: 15888 (11238 samples)
       min=3528, max=31677
RPS percentiles (requests) runtime 100 (s) (101 total samples)
       20.0th: 16864 (31 samples)
      * 50.0th: 16928 (26 samples)
       90.0th: 17248 (36 samples)
       min=16615, max=20041
average rps: 17025.23

NEXT_BUDDY
Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
       50.0th: 5 (376845 samples)
       90.0th: 12 (632075 samples)
      * 99.0th: 24 (114398 samples)
       99.9th: 105 (13737 samples)
       min=1, max=7428
Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
       50.0th: 14480 (524763 samples)
       90.0th: 15216 (647982 samples)
      * 99.0th: 15472 (130730 samples)
       99.9th: 15728 (13980 samples)
       min=3542, max=34805
RPS percentiles (requests) runtime 100 (s) (101 total samples)
       20.0th: 16544 (62 samples)
      * 50.0th: 16544 (0 samples)
       90.0th: 16608 (37 samples)
       min=16470, max=16648
average rps: 16572.68

So I think NEXT_BUDDY has more or less reduced the rb-tree scan.

thanks,
Chenyu