Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

From: K Prateek Nayak
Date: Tue May 28 2024 - 03:47:35 EST


Hello Chunxin,

On 5/28/2024 12:48 PM, Chunxin Zang wrote:
> [..snip..]
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>>
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>
>> This check uses the root cfs_rq since "task_cfs_rq()" returns the
>> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
>> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
>> on a higher order cfs_rq and this entity_eligible() calculation might
>> not be valid since the vruntime calculation for the "se" is relative to
>> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
>> I believe that is what Chenyu was referring to in [1].
>
>
> Thank you for explaining so much to me; I am trying to understand all of this. :)
>
>>
>>> find_matching_se(&se, &pse);
>>> WARN_ON_ONCE(!pse);
>>>
>>> --
>>
>> In addition to that, There is an update_curr() call below for the first
>> cfs_rq where both the entities' hierarchy is queued which is found by
>> find_matching_se(). I believe that is required too to update the
>> vruntime and deadline of the entity where preemption can happen.
>>
>> If you want to circumvent a second call to pick_eevdf(), could you
>> perhaps do:
>>
>> (Only build tested)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9eb63573110c..653b1bee1e62 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> update_curr(cfs_rq);
>>
>> /*
>> - * XXX pick_eevdf(cfs_rq) != se ?
>> + * If the hierarchy of current task is ineligible at the common
>> + * point on the newly woken entity, there is a good chance of
>> + * wakeup preemption by the newly woken entity. Mark for resched
>> + * and allow pick_eevdf() in schedule() to judge which task to
>> + * run next.
>> */
>> - if (pick_eevdf(cfs_rq) == pse)
>> + if (!entity_eligible(cfs_rq, se))
>> goto preempt;
>>
>> return;
>>
>> --
>>
>> There are other implications here which is specifically highlighted by
>> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
>> entity is not the entity with the earliest eligible virtual deadline,
>> the current task is still preempted if any other entity has the EEVD.
>>
>> Mike's box gave switching to above two thumbs up; I have to check what
>> my box says :)
>>
>> Following are DeathStarBench results with your original patch compared
>> to v6.9-rc5 based tip:sched/core:
>>
>> ==================================================================
>> Test : DeathStarBench
>> Why? : Some tasks here do no like aggressive preemption
>> Units : Normalized throughput
>> Interpretation: Higher is better
>> Statistic : Mean
>> ==================================================================
>> Pinning scaling tip eager_preempt (pct imp)
>> 1CCD 1 1.00 0.99 (%diff: -1.13%)
>> 2CCD 2 1.00 0.97 (%diff: -3.21%)
>> 4CCD 3 1.00 0.97 (%diff: -3.41%)
>> 8CCD 6 1.00 0.97 (%diff: -3.20%)
>> --
>
> Please forgive me as I have not used the DeathStarBench suite before. Does
> this test result indicate that my modifications have resulted in tasks that do no
> like aggressive preemption being even less likely to be preempted?

It is actually the opposite. In case of DeathStarBench, the nginx server
tasks responsible for being the entrypoint into the microservice chain
do not like to be preempted. A regression generally indicates that these
tasks have very likely been preempted as a result of which the throughput
drops. More information for DeathStarBench and the problem is highlighted
in https://lore.kernel.org/lkml/20240325060226.1540-1-kprateek.nayak@xxxxxxx/

I'll test with more workloads later today and update the thread. Please
forgive for any delay, I'm slowly crawling through a backlog of
testing.

--
Thanks and Regards,
Prateek

>
> thanks
> Chunxin
>
>> I'll give the variants mentioned in the thread a try too to see if
>> some of my assumptions around heavy preemption hold good. I was also
>> able to dig up an old patch by Balakumaran Kannan which skipped
>> pick_eevdf() altogether if "pse" is ineligible which also seems like
>> a good optimization based on current check in
>> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
>> wakeup-latency sensitivity you are optimizing for; only reduces
>> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
>> https://lore.kernel.org/lkml/20240301130100.267727-1-kumaran.4353@xxxxxxxxx/
>>
>> [..snip..]
>>