Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

From: Chunxin Zang
Date: Tue Jun 11 2024 - 09:11:19 EST




> On Jun 7, 2024, at 10:38, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>
> On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
>>
>>
>>> On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>>>
>>>
>>> Sorry for the late reply and thanks for help clarify this. Yes, this is
>>> what my previous concern was:
>>> 1. It does not consider the cgroup and does not check preemption in the same
>>> level which is covered by find_matching_se().
>>> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
>>> later pick_eevdf() will check the eligible of current anyway. But
>>> as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
>>> I just wonder if we could leverage the cfs_rq->next to store the next
>>> candidate, so it can be picked directly in the 2nd pick as a fast path?
>>> Something like below untested:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 8a5b1ae0aa55..f716646d595e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
>>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>> {
>>> struct task_struct *curr = rq->curr;
>>> - struct sched_entity *se = &curr->se, *pse = &p->se;
>>> + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
>>> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>>> int cse_is_idle, pse_is_idle;
>>>
>>> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> /*
>>> * XXX pick_eevdf(cfs_rq) != se ?
>>> */
>>> - if (pick_eevdf(cfs_rq) == pse)
>>> + next = pick_eevdf(cfs_rq);
>>> + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
>>> + set_next_buddy(next);
>>> +
>>> + if (next == pse)
>>> goto preempt;
>>>
>>> return;
>>>
>>>
>>> thanks,
>>> Chenyu
>>
>> Hi Chen
>>
>> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
>> the RB-tree twice, I initially had two methods in mind.
>> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
>> This idea is similar to the one you proposed this time.
>> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
>> Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
>> process to schedule' are two different things.
>
> I agree, and it seems that in current eevdf implementation the former relies on the latter.
>
>> 'check_preempt_wakeup_fair' is not just to
>> check if the newly awakened process should preempt the current process; it can also serve
>> as an opportunity to check whether any other processes should preempt the current one,
>> thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
>> the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
>> then the current process will still not be preempted.
>
> I thought Mike has proposed a patch to deal with this scenario you mentioned above:
> https://lore.kernel.org/lkml/e17d3d90440997b970067fe9eaf088903c65f41d.camel@xxxxxx/
>
> And I suppose you are refering to increase the preemption chance on current rather than reducing
> the invoke of pick_eevdf() in check_preempt_wakeup_fair().

Hi chen

Happy holidays. I believe the modifications here will indeed provide more opportunities for preemption,
thereby leading to lower scheduling latencies, while also truly reducing calls to pick_eevdf. It's a win-win situation. :)

I conducted a test. It involved applying my modifications on top of MIKE PATCH, along with
adding some statistical counts following your previous method, in order to assess the potential
benefits of my changes.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..c5453866899f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8283,6 +8286,10 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;
+ bool patch_preempt = false;
+ bool pick_preempt = false;
+
+ schedstat_inc(rq->check_preempt_count);

if (unlikely(se == pse))
return;
@@ -8343,15 +8350,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
cfs_rq = cfs_rq_of(se);
update_curr(cfs_rq);

+ if ((sched_feat(RUN_TO_PARITY) && se->vlag != se->deadline && !entity_eligible(cfs_rq, se))
+ || (!sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, se))) {
+ schedstat_inc(rq->patch_preempt_count);
+ patch_preempt = true;
+ }
+
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq) != se) {
+ schedstat_inc(rq->pick_preempt_count);
+ pick_preempt = true;
goto preempt;
+ }

return;

preempt:
+ if (patch_preempt && !pick_preempt)
+ schedstat_inc(rq->patch_preempt_only_count);
+ if (!patch_preempt && pick_preempt)
+ schedstat_inc(rq->pick_preempt_only_count);
+
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..002c6b0f966a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1141,6 +1141,12 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int patch_preempt_count;
+ unsigned int patch_preempt_only_count;
+ unsigned int pick_preempt_count;
+ unsigned int pick_preempt_only_count;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..fe5487572409 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,21 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu *** %u %u * %u %u * %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count,
+ rq->need_preempt_count,
+ rq->patch_preempt_count,
+ rq->patch_preempt_only_count,
+ rq->pick_preempt_count,
+ rq->pick_preempt_only_count);
+

seq_printf(seq, "\n");

The test results are as follows:

RUN_TO_PARITY:
EEVDF PATCH
.stat.check_preempt_count 5053054 5029546
.stat.need_preempt_count 0570520 1282780
.stat.patch_preempt_count ------- 0038602
.stat.patch_preempt_only_count ------- 0000000
.stat.pick_preempt_count ------- 1282780
.stat.pick_preempt_only_count ------- 1244178

NO_RUN_TO_PARITY:
EEVDF PATCH
.stat.check_preempt_count 5018589 5005812
.stat.need_preempt_count 3380513 2994773
.stat.patch_preempt_count ------- 0907927
.stat.patch_preempt_only_count ------- 0000000
.stat.pick_preempt_count ------- 2994773
.stat.pick_preempt_only_count ------- 2086846

Looking at the results, adding an ineligible check for the se within check_preempt_wakeup_fair
can prevent 3% of pick_eevdf calls under the RUN_TO_PARITY feature, and in the case of
NO_RUN_TO_PARITY, it can prevent 30% of pick_eevdf calls. It was also discovered that the
patch_preempt_only_count is at 0, indicating that all invalid checks for the se are correct.

It's worth mentioning that under the RUN_TO_PARITY feature, the number of preemptions
triggered by 'pick_eevdf != se' would be 2.25 times that of the original version, which could
lead to a series of other performance issues. However, logically speaking, this is indeed reasonable. :(


>
>> Therefore, I posted the v2 PATCH.
>> The implementation of v2 PATCH might express this point more clearly.
>> https://lore.kernel.org/lkml/20240529141806.16029-1-spring.cxz@xxxxxxxxx/T/
>>
>
> Let me take a look at it and do some tests.

Thank you for doing this :)

>
>> I previously implemented and tested both of these methods, and the test results showed that
>> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
>> think about it, perhaps method 1 could also be viable at the same time. :)
>>
>
> Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
> wakeup latency/request latency are both reduced. The following is the schbench result on a
> 240 CPUs system:
>
> NO_NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
>        50.0th: 6 (429125 samples)
>        90.0th: 14 (682355 samples)
>       * 99.0th: 29 (126695 samples)
>        99.9th: 529 (14603 samples)
>        min=1, max=4741
> Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
>        50.0th: 14992 (550939 samples)
>        90.0th: 15376 (668687 samples)
>       * 99.0th: 15600 (128111 samples)
>        99.9th: 15888 (11238 samples)
>        min=3528, max=31677
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
>        20.0th: 16864 (31 samples)
>       * 50.0th: 16928 (26 samples)
>        90.0th: 17248 (36 samples)
>        min=16615, max=20041
> average rps: 17025.23
>
> NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
>        50.0th: 5 (376845 samples)
>        90.0th: 12 (632075 samples)
>       * 99.0th: 24 (114398 samples)
>        99.9th: 105 (13737 samples)
>        min=1, max=7428
> Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
>        50.0th: 14480 (524763 samples)
>        90.0th: 15216 (647982 samples)
>       * 99.0th: 15472 (130730 samples)
>        99.9th: 15728 (13980 samples)
>        min=3542, max=34805
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
>        20.0th: 16544 (62 samples)
>       * 50.0th: 16544 (0 samples)
>        90.0th: 16608 (37 samples)
>        min=16470, max=16648
> average rps: 16572.68
>
> So I think NEXT_BUDDY has more or less reduced the rb-tree scan.
>
> thanks,
> Chenyu

I'm not completely sure if my understanding is correct, but NEXT_BUDDY can only cache the process
that has been woken up; it doesn't necessarily correspond to the result returned by pick_eevdf. Furthermore,
even if it does cache the result returned by pick_eevdf, by the time the next scheduling occurs, due to
other processes enqueing or dequeuing, it might not be the result picked by pick_eevdf at that moment.
Hence, it's a 'best effort' approach, and therefore, its impact on scheduling latency may vary depending
on the use case.

thanks
Chunxin