Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is ineligible

From: Chunxin Zang
Date: Wed Jun 05 2024 - 21:47:26 EST




> On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>
> Hi Prateek, Chunxin,
>
> On 2024-05-28 at 10:32:23 +0530, K Prateek Nayak wrote:
>> Hello Chunxin,
>>
>> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
>>>
>>>> On May 24, 2024, at 23:30, Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>>>>
>>>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>>>> I found that some tasks have been running for a long enough time and
>>>>> have become illegal, but they are still not releasing the CPU. This
>>>>> will increase the scheduling delay of other processes. Therefore, I
>>>>> tried checking the current process in wakeup_preempt and entity_tick,
>>>>> and if it is illegal, reschedule that cfs queue.
>>>>>
>>>>> The modification can reduce the scheduling delay by about 30% when
>>>>> RUN_TO_PARITY is enabled.
>>>>> So far, it has been running well in my test environment, and I have
>>>>> pasted some test results below.
>>>>>
>>>>
>>>> Interesting, besides hackbench, I assume that you have workload in
>>>> real production environment that is sensitive to wakeup latency?
>>>
>>> Hi Chen
>>>
>>> Yes, my workload are quite sensitive to wakeup latency .
>>>>
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 03be0d1330a6..a0005d240db5 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>>> hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>>>> return;
>>>>> #endif
>>>>> +
>>>>> + if (!entity_eligible(cfs_rq, curr))
>>>>> + resched_curr(rq_of(cfs_rq));
>>>>> }
>>>>>
>>>>
>>>> entity_tick() -> update_curr() -> update_deadline():
>>>> se->vruntime >= se->deadline ? resched_curr()
>>>> only current has expired its slice will it be scheduled out.
>>>>
>>>> So here you want to schedule current out if its lag becomes 0.
>>>>
>>>> In lastest sched/eevdf branch, it is controlled by two sched features:
>>>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>>>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>>>
>>>> Maybe something like this can achieve your goal
>>>> if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>>>> resched_curr
>>>>
>>>>>
>>>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>>>> return;
>>>>>
>>>>> + if (!entity_eligible(cfs_rq, se))
>>>>> + goto preempt;
>>>>> +
>>>>
>>>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>>>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>>>> be evicted. And this change does not consider the cgroup hierarchy.
>>
>> The above line will be referred to as [1] below.
>>
>>>>
>>>> Besides, the check of current eligiblity can get false negative result,
>>>> if the enqueued entity has a positive lag. Prateek proposed to
>>>> remove the check of current's eligibility in pick_eevdf():
>>>> https://lore.kernel.org/lkml/20240325060226.1540-2-kprateek.nayak@xxxxxxx/
>>>
>>> Thank you for letting me know about Peter's latest updates and thoughts.
>>> Actually, the original intention of my modification was to minimize the
>>> traversal of the rb-tree as much as possible. For example, in the following
>>> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
>>> 'pick_eevdf' to return an optimal 'se', and then trigger 'resched_curr'. After
>>> resched, the scheduler will call 'pick_eevdf' again, traversing the
>>> rb-tree once more. This ultimately results in the rb-tree being traversed
>>> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
>>> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
>>> by one time.
>>>
>>>
>>> wakeup_preempt-> pick_eevdf -> resched_curr
>>> |->'traverse the rb-tree' |
>>> schedule->pick_eevdf
>>> |->'traverse the rb-tree'
>>
>> I see what you mean but a couple of things:
>>
>> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
>> below for ease of interpretation)
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> return;
>>>
>>> + if (!entity_eligible(cfs_rq, se))
>>> + goto preempt;
>>> +
>>
>> This check uses the root cfs_rq since "task_cfs_rq()" returns the
>> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
>> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
>> on a higher order cfs_rq and this entity_eligible() calculation might
>> not be valid since the vruntime calculation for the "se" is relative to
>> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
>> I believe that is what Chenyu was referring to in [1].
>>
>
> Sorry for the late reply and thanks for help clarify this. Yes, this is
> what my previous concern was:
> 1. It does not consider the cgroup and does not check preemption in the same
> level which is covered by find_matching_se().
> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> later pick_eevdf() will check the eligible of current anyway. But
> as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> I just wonder if we could leverage the cfs_rq->next to store the next
> candidate, so it can be picked directly in the 2nd pick as a fast path?
> Something like below untested:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..f716646d595e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> struct task_struct *curr = rq->curr;
> - struct sched_entity *se = &curr->se, *pse = &p->se;
> + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> int cse_is_idle, pse_is_idle;
>
> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> /*
> * XXX pick_eevdf(cfs_rq) != se ?
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + next = pick_eevdf(cfs_rq);
> + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> + set_next_buddy(next);
> +
> + if (next == pse)
> goto preempt;
>
> return;
>
>
> thanks,
> Chenyu

Hi Chen

First of all, thank you for your patient response. Regarding the issue of avoiding traversing
the RB-tree twice, I initially had two methods in mind.
1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
This idea is similar to the one you proposed this time.
2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
process to schedule' are two different things. 'check_preempt_wakeup_fair' is not just to
check if the newly awakened process should preempt the current process; it can also serve
as an opportunity to check whether any other processes should preempt the current one,
thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
then the current process will still not be preempted. Therefore, I posted the v2 PATCH.
The implementation of v2 PATCH might express this point more clearly.
https://lore.kernel.org/lkml/20240529141806.16029-1-spring.cxz@xxxxxxxxx/T/

I previously implemented and tested both of these methods, and the test results showed that
method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
think about it, perhaps method 1 could also be viable at the same time. :)

thanks
Chunixn

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..f67894d8fbc8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -563,6 +563,8 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
return (s64)(se->vruntime - cfs_rq->min_vruntime);
}

+static void unset_pick_cached(struct cfs_rq *cfs_rq);
+
#define __node_2_se(node) \
rb_entry((node), struct sched_entity, run_node)

@@ -632,6 +634,8 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)

cfs_rq->avg_vruntime += key * weight;
cfs_rq->avg_load += weight;
+
+ unset_pick_cached(cfs_rq);
}

static void
@@ -642,6 +646,8 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)

cfs_rq->avg_vruntime -= key * weight;
cfs_rq->avg_load -= weight;
+
+ unset_pick_cached(cfs_rq);
}

static inline
@@ -651,6 +657,8 @@ void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
* v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
*/
cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+
+ unset_pick_cached(cfs_rq);
}

/*
@@ -745,6 +753,36 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
return vruntime_eligible(cfs_rq, se->vruntime);
}

+static struct sched_entity *try_to_get_pick_cached(struct cfs_rq* cfs_rq)
+{
+ struct sched_entity *se;
+
+ se = cfs_rq->pick_cached;
+
+ return se == NULL ? NULL : (se->on_rq ? se : NULL);
+}
+
+static void unset_pick_cached(struct cfs_rq *cfs_rq)
+{
+ cfs_rq->pick_cached = NULL;
+}
+
+static void set_pick_cached(struct sched_entity *se)
+{
+ if (!se || !se->on_rq)
+ return;
+
+ cfs_rq_of(se)->pick_cached = se;
+}
+
static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
{
u64 min_vruntime = cfs_rq->min_vruntime;
@@ -856,6 +894,51 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
return __node_2_se(left);
}

+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
+{
+ struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+ struct sched_entity *se = __pick_first_entity(cfs_rq);
+ struct sched_entity *best = NULL;
+
+ /* Pick the leftmost entity if it's eligible */
+ if (se && entity_eligible(cfs_rq, se))
+ return se;
+
+ /* Heap search for the EEVD entity */
+ while (node) {
+ struct rb_node *left = node->rb_left;
+
+ /*
+ * Eligible entities in left subtree are always better
+ * choices, since they have earlier deadlines.
+ */
+ if (left && vruntime_eligible(cfs_rq,
+ __node_2_se(left)->min_vruntime)) {
+ node = left;
+ continue;
+ }
+
+ se = __node_2_se(node);
+
+ /*
+ * The left subtree either is empty or has no eligible
+ * entity, so check the current node since it is the one
+ * with earliest deadline that might be eligible.
+ */
+ if (entity_eligible(cfs_rq, se)) {
+ best = se;
+ break;
+ }
+
+ node = node->rb_right;
+ }
+
+ if (best)
+ set_pick_cached(best);
+
+ return best;
+}
+
/*
* Earliest Eligible Virtual Deadline First
*
@@ -877,7 +960,6 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
*/
static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
{
- struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;
@@ -899,41 +981,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
return curr;

- /* Pick the leftmost entity if it's eligible */
- if (se && entity_eligible(cfs_rq, se)) {
- best = se;
- goto found;
- }
+ best = try_to_get_pick_cached(cfs_rq);
+ if (best && !entity_eligible(cfs_rq, best))
+ best = NULL;

- /* Heap search for the EEVD entity */
- while (node) {
- struct rb_node *left = node->rb_left;
-
- /*
- * Eligible entities in left subtree are always better
- * choices, since they have earlier deadlines.
- */
- if (left && vruntime_eligible(cfs_rq,
- __node_2_se(left)->min_vruntime)) {
- node = left;
- continue;
- }
-
- se = __node_2_se(node);
+ if (!best)
+ best = __pick_eevdf(cfs_rq);

- /*
- * The left subtree either is empty or has no eligible
- * entity, so check the current node since it is the one
- * with earliest deadline that might be eligible.
- */
- if (entity_eligible(cfs_rq, se)) {
- best = se;
- break;
- }
-
- node = node->rb_right;
- }
-found:
if (!best || (curr && entity_before(curr, best)))
best = curr;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..373241075449 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,6 +597,7 @@ struct cfs_rq {
*/
struct sched_entity *curr;
struct sched_entity *next;
+ struct sched_entity *pick_cached;

#ifdef CONFIG_SCHED_DEBUG
unsigned int nr_spread_over;
--
2.34.1


>
>>> find_matching_se(&se, &pse);
>>> WARN_ON_ONCE(!pse);
>>>
>>> --
>>
>> In addition to that, There is an update_curr() call below for the first
>> cfs_rq where both the entities' hierarchy is queued which is found by
>> find_matching_se(). I believe that is required too to update the
>> vruntime and deadline of the entity where preemption can happen.
>>
>> If you want to circumvent a second call to pick_eevdf(), could you
>> perhaps do:
>>
>> (Only build tested)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9eb63573110c..653b1bee1e62 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>> update_curr(cfs_rq);
>>
>> /*
>> - * XXX pick_eevdf(cfs_rq) != se ?
>> + * If the hierarchy of current task is ineligible at the common
>> + * point on the newly woken entity, there is a good chance of
>> + * wakeup preemption by the newly woken entity. Mark for resched
>> + * and allow pick_eevdf() in schedule() to judge which task to
>> + * run next.
>> */
>> - if (pick_eevdf(cfs_rq) == pse)
>> + if (!entity_eligible(cfs_rq, se))
>> goto preempt;
>>
>> return;
>>
>> --
>>
>> There are other implications here which is specifically highlighted by
>> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
>> entity is not the entity with the earliest eligible virtual deadline,
>> the current task is still preempted if any other entity has the EEVD.
>>
>> Mike's box gave switching to above two thumbs up; I have to check what
>> my box says :)
>>
>> Following are DeathStarBench results with your original patch compared
>> to v6.9-rc5 based tip:sched/core:
>>
>> ==================================================================
>> Test : DeathStarBench
>> Why? : Some tasks here do no like aggressive preemption
>> Units : Normalized throughput
>> Interpretation: Higher is better
>> Statistic : Mean
>> ==================================================================
>> Pinning scaling tip eager_preempt (pct imp)
>> 1CCD 1 1.00 0.99 (%diff: -1.13%)
>> 2CCD 2 1.00 0.97 (%diff: -3.21%)
>> 4CCD 3 1.00 0.97 (%diff: -3.41%)
>> 8CCD 6 1.00 0.97 (%diff: -3.20%)
>> --
>>
>> I'll give the variants mentioned in the thread a try too to see if
>> some of my assumptions around heavy preemption hold good. I was also
>> able to dig up an old patch by Balakumaran Kannan which skipped
>> pick_eevdf() altogether if "pse" is ineligible which also seems like
>> a good optimization based on current check in
>> check_preempt_wakeup_fair() but it perhaps doesn't help the case of
>> wakeup-latency sensitivity you are optimizing for; only reduces
>> rb-tree traversal if there is no chance of pick_eevdf() returning "pse"
>> https://lore.kernel.org/lkml/20240301130100.267727-1-kumaran.4353@xxxxxxxxx/
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
>>>
>>>
>>> Of course, this would break the semantics of RESPECT_SLICE as well as
>>> RUN_TO_PARITY. So, this might be considered a performance enhancement
>>> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
>>>
>>> thanks
>>> Chunxin
>>>
>>>
>>>> If I understand your requirement correctly, you want to reduce the wakeup
>>>> latency. There are some codes under developed by Peter, which could
>>>> customized task's wakeup latency via setting its slice:
>>>> https://lore.kernel.org/lkml/20240405110010.934104715@xxxxxxxxxxxxx/
>>>>
>>>> thanks,
>>>> Chenyu