Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

From: Chen Yu
Date: Wed Jun 05 2024 - 13:20:07 EST


Hi Prateek, Chunxin,

On 2024-05-28 at 10:32:23 +0530, K Prateek Nayak wrote:
> Hello Chunxin,
>
> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
> >
> >> On May 24, 2024, at 23:30, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> >>
> >> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
> >>> I found that some tasks have been running for a long enough time and
> >>> have become illegal, but they are still not releasing the CPU. This
> >>> will increase the scheduling delay of other processes. Therefore, I
> >>> tried checking the current process in wakeup_preempt and entity_tick,
> >>> and if it is illegal, reschedule that cfs queue.
> >>>
> >>> The modification can reduce the scheduling delay by about 30% when
> >>> RUN_TO_PARITY is enabled.
> >>> So far, it has been running well in my test environment, and I have
> >>> pasted some test results below.
> >>>
> >>
> >> Interesting, besides hackbench, I assume that you have workload in
> >> real production environment that is sensitive to wakeup latency?
> >
> > Hi Chen
> >
> > Yes, my workload are quite sensitive to wakeup latency .
> >>
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 03be0d1330a6..a0005d240db5 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> >>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
> >>> return;
> >>> #endif
> >>> +
> >>> + if (!entity_eligible(cfs_rq, curr))
> >>> + resched_curr(rq_of(cfs_rq));
> >>> }
> >>>
> >>
> >> entity_tick() -> update_curr() -> update_deadline():
> >> se->vruntime >= se->deadline ? resched_curr()
> >> only current has expired its slice will it be scheduled out.
> >>
> >> So here you want to schedule current out if its lag becomes 0.
> >>
> >> In lastest sched/eevdf branch, it is controlled by two sched features:
> >> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
> >> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
> >> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
> >>
> >> Maybe something like this can achieve your goal
> >> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
> >> resched_curr
> >>
> >>>
> >>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> >>> return;
> >>>
> >>> + if (!entity_eligible(cfs_rq, se))
> >>> + goto preempt;
> >>> +
> >>
> >> Not sure if this is applicable, later in this function, pick_eevdf() checks
> >> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
> >> be evicted. And this change does not consider the cgroup hierarchy.
>
> The above line will be referred to as [1] below.
>
> >>
> >> Besides, the check of current eligiblity can get false negative result,
> >> if the enqueued entity has a positive lag. Prateek proposed to
> >> remove the check of current's eligibility in pick_eevdf():
> >> https://lore.kernel.org/lkml/20240325060226.1540-2-kprateek.nayak@xxxxxxx/
> >
> > Thank you for letting me know about Peter's latest updates and thoughts.
> > Actually, the original intention of my modification was to minimize the
> > traversal of the rb-tree as much as possible. For example, in the following
> > scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
> > 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
> > resched, the scheduler will call 'pick_eevdf' again, traversing the
> > rb-tree once more. This ultimately results in the rb-tree being traversed
> > twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
> > and directly trigger a 'resched', it would reduce the traversal of the rb-tree
> > by one time.
> >
> >
> > wakeup_preempt-> pick_eevdf -> resched_curr
> > |->'traverse the rb-tree' |
> > schedule->pick_eevdf
> > |->'traverse the rb-tree'
>
> I see what you mean but a couple of things:
>
> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
> below for ease of interpretation)
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 03be0d1330a6..a0005d240db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> > if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> > return;
> >
> > + if (!entity_eligible(cfs_rq, se))
> > + goto preempt;
> > +
>
> This check uses the root cfs_rq since "task_cfs_rq()" returns the
> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
> on a higher order cfs_rq and this entity_eligible() calculation might
> not be valid since the vruntime calculation for the "se" is relative to
> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
> I believe that is what Chenyu was referring to in [1].
>

Sorry for the late reply and thanks for help clarify this. Yes, this is
what my previous concern was:
1. It does not consider the cgroup and does not check preemption in the same
level which is covered by find_matching_se().
2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
later pick_eevdf() will check the eligible of current anyway. But
as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
I just wonder if we could leverage the cfs_rq->next to store the next
candidate, so it can be picked directly in the 2nd pick as a fast path?
Something like below untested:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..f716646d595e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
- struct sched_entity *se = &curr->se, *pse = &p->se;
+ struct sched_entity *se = &curr->se, *pse = &p->se, *next;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;

@@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ next = pick_eevdf(cfs_rq);
+ if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
+ set_next_buddy(next);
+
+ if (next == pse)
goto preempt;

return;


thanks,
Chenyu

> > find_matching_se(&se, &pse);
> > WARN_ON_ONCE(!pse);
> >
> > --
>
> In addition to that, There is an update_curr() call below for the first
> cfs_rq where both the entities' hierarchy is queued which is found by
> find_matching_se(). I believe that is required too to update the
> vruntime and deadline of the entity where preemption can happen.
>
> If you want to circumvent a second call to pick_eevdf(), could you
> perhaps do:
>
> (Only build tested)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..653b1bee1e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> update_curr(cfs_rq);
>
> /*
> - * XXX pick_eevdf(cfs_rq) != se ?
> + * If the hierarchy of current task is ineligible at the common
> + * point on the newly woken entity, there is a good chance of
> + * wakeup preemption by the newly woken entity. Mark for resched
> + * and allow pick_eevdf() in schedule() to judge which task to
> + * run next.
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + if (!entity_eligible(cfs_rq, se))
> goto preempt;
>
> return;
>
> --
>
> There are other implications here which is specifically highlighted by
> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
> entity is not the entity with the earliest eligible virtual deadline,
> the current task is still preempted if any other entity has the EEVD.
>
> Mike's box gave switching to above two thumbs up; I have to check what
> my box says :)
>
> Following are DeathStarBench results with your original patch compared
> to v6.9-rc5 based tip:sched/core:
>
> ==================================================================
> Test : DeathStarBench
> Why? : Some tasks here do no like aggressive preemption
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : Mean
> ==================================================================
> Pinning scaling tip eager_preempt (pct imp)
> 1CCD 1 1.00 0.99 (%diff: -1.13%)
> 2CCD 2 1.00 0.97 (%diff: -3.21%)
> 4CCD 3 1.00 0.97 (%diff: -3.41%)
> 8CCD 6 1.00 0.97 (%diff: -3.20%)
> --
>
> I'll give the variants mentioned in the thread a try too to see if
> some of my assumptions around heavy preemption hold good. I was also
> able to dig up an old patch by Balakumaran Kannan which skipped
> pick_eevdf() altogether if "pse" is ineligible which also seems like
> a good optimization based on current check in
> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
> wakeup-latency sensitivity you are optimizing for; only reduces
> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
> https://lore.kernel.org/lkml/20240301130100.267727-1-kumaran.4353@xxxxxxxxx/
>
> --
> Thanks and Regards,
> Prateek
>
> >
> >
> > Of course, this would break the semantics of RESPECT_SLICE as well as
> > RUN_TO_PARITY. So, this might be considered a performance enhancement
> > for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
> >
> > thanks
> > Chunxin
> >
> >
> >> If I understand your requirement correctly, you want to reduce the wakeup
> >> latency. There are some codes under developed by Peter, which could
> >> customized task's wakeup latency via setting its slice:
> >> https://lore.kernel.org/lkml/20240405110010.934104715@xxxxxxxxxxxxx/
> >>
> >> thanks,
> >> Chenyu
>