Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

From: Chunxin Zang
Date: Tue May 28 2024 - 03:19:29 EST


Hi Prateek

> On May 28, 2024, at 13:02, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>
> Hello Chunxin,
>
> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
>>
>>> On May 24, 2024, at 23:30, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>>>
>>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>>> I found that some tasks have been running for a long enough time and
>>>> have become illegal, but they are still not releasing the CPU. This
>>>> will increase the scheduling delay of other processes. Therefore, I
>>>> tried checking the current process in wakeup_preempt and entity_tick,
>>>> and if it is illegal, reschedule that cfs queue.
>>>>
>>>> The modification can reduce the scheduling delay by about 30% when
>>>> RUN_TO_PARITY is enabled.
>>>> So far, it has been running well in my test environment, and I have
>>>> pasted some test results below.
>>>>
>>>
>>> Interesting, besides hackbench, I assume that you have workload in
>>> real production environment that is sensitive to wakeup latency?
>>
>> Hi Chen
>>
>> Yes, my workload are quite sensitive to wakeup latency .
>>>
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 03be0d1330a6..a0005d240db5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>>> return;
>>>> #endif
>>>> +
>>>> + if (!entity_eligible(cfs_rq, curr))
>>>> + resched_curr(rq_of(cfs_rq));
>>>> }
>>>>
>>>
>>> entity_tick() -> update_curr() -> update_deadline():
>>> se->vruntime >= se->deadline ? resched_curr()
>>> only current has expired its slice will it be scheduled out.
>>>
>>> So here you want to schedule current out if its lag becomes 0.
>>>
>>> In lastest sched/eevdf branch, it is controlled by two sched features:
>>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>>
>>> Maybe something like this can achieve your goal
>>> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>>> resched_curr
>>>
>>>>
>>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>>> return;
>>>>
>>>> + if (!entity_eligible(cfs_rq, se))
>>>> + goto preempt;
>>>> +
>>>
>>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>>> be evicted. And this change does not consider the cgroup hierarchy.
>
> The above line will be referred to as [1] below.
>
>>>
>>> Besides, the check of current eligiblity can get false negative result,
>>> if the enqueued entity has a positive lag. Prateek proposed to
>>> remove the check of current's eligibility in pick_eevdf():
>>> https://lore.kernel.org/lkml/20240325060226.1540-2-kprateek.nayak@xxxxxxx/
>>
>> Thank you for letting me know about Peter's latest updates and thoughts.
>> Actually, the original intention of my modification was to minimize the
>> traversal of the rb-tree as much as possible. For example, in the following
>> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
>> 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
>> resched, the scheduler will call 'pick_eevdf' again, traversing the
>> rb-tree once more. This ultimately results in the rb-tree being traversed
>> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
>> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
>> by one time.
>>
>>
>> wakeup_preempt-> pick_eevdf -> resched_curr
>> |->'traverse the rb-tree' |
>> schedule->pick_eevdf
>> |->'traverse the rb-tree'
>
> I see what you mean but a couple of things:
>
> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
> below for ease of interpretation)
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 03be0d1330a6..a0005d240db5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>> return;
>>
>> + if (!entity_eligible(cfs_rq, se))
>> + goto preempt;
>> +
>
> This check uses the root cfs_rq since "task_cfs_rq()" returns the
> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
> on a higher order cfs_rq and this entity_eligible() calculation might
> not be valid since the vruntime calculation for the "se" is relative to
> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
> I believe that is what Chenyu was referring to in [1].


Thank you for explaining so much to me; I am trying to understand all of this. :)

>
>> find_matching_se(&se, &pse);
>> WARN_ON_ONCE(!pse);
>>
>> --
>
> In addition to that, There is an update_curr() call below for the first
> cfs_rq where both the entities' hierarchy is queued which is found by
> find_matching_se(). I believe that is required too to update the
> vruntime and deadline of the entity where preemption can happen.
>
> If you want to circumvent a second call to pick_eevdf(), could you
> perhaps do:
>
> (Only build tested)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..653b1bee1e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> update_curr(cfs_rq);
>
> /*
> - * XXX pick_eevdf(cfs_rq) != se ?
> + * If the hierarchy of current task is ineligible at the common
> + * point on the newly woken entity, there is a good chance of
> + * wakeup preemption by the newly woken entity. Mark for resched
> + * and allow pick_eevdf() in schedule() to judge which task to
> + * run next.
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + if (!entity_eligible(cfs_rq, se))
> goto preempt;
>
> return;
>
> --
>
> There are other implications here which is specifically highlighted by
> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
> entity is not the entity with the earliest eligible virtual deadline,
> the current task is still preempted if any other entity has the EEVD.
>
> Mike's box gave switching to above two thumbs up; I have to check what
> my box says :)
>
> Following are DeathStarBench results with your original patch compared
> to v6.9-rc5 based tip:sched/core:
>
> ==================================================================
> Test : DeathStarBench
> Why? : Some tasks here do no like aggressive preemption
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : Mean
> ==================================================================
> Pinning scaling tip eager_preempt (pct imp)
> 1CCD 1 1.00 0.99 (%diff: -1.13%)
> 2CCD 2 1.00 0.97 (%diff: -3.21%)
> 4CCD 3 1.00 0.97 (%diff: -3.41%)
> 8CCD 6 1.00 0.97 (%diff: -3.20%)
> --

Please forgive me as I have not used the DeathStarBench suite before. Does
this test result indicate that my modifications have resulted in tasks that do no
like aggressive preemption being even less likely to be preempted?

thanks
Chunxin

> I'll give the variants mentioned in the thread a try too to see if
> some of my assumptions around heavy preemption hold good. I was also
> able to dig up an old patch by Balakumaran Kannan which skipped
> pick_eevdf() altogether if "pse" is ineligible which also seems like
> a good optimization based on current check in
> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
> wakeup-latency sensitivity you are optimizing for; only reduces
> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
> https://lore.kernel.org/lkml/20240301130100.267727-1-kumaran.4353@xxxxxxxxx/
>
> --
> Thanks and Regards,
> Prateek
>
>>
>>
>> Of course, this would break the semantics of RESPECT_SLICE as well as
>> RUN_TO_PARITY. So, this might be considered a performance enhancement
>> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
>>
>> thanks
>> Chunxin
>>
>>
>>> If I understand your requirement correctly, you want to reduce the wakeup
>>> latency. There are some codes under developed by Peter, which could
>>> customized task's wakeup latency via setting its slice:
>>> https://lore.kernel.org/lkml/20240405110010.934104715@xxxxxxxxxxxxx/
>>>
>>> thanks,
>>> Chenyu