Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

From: benbjiang(蒋彪)
Date: Thu Aug 06 2020 - 12:32:09 EST


Hi,

> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>
>>
>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>
>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>> From: Jiang Biao <benbjiang@xxxxxxxxxxx>
>>>>
>>>> No need to preempt when there are only one runable CFS task with
>>>> other IDLE tasks on runqueue. The only one CFS task would always
>>>> be picked in that case.
>>>>
>>>> Signed-off-by: Jiang Biao <benbjiang@xxxxxxxxxxx>
>>>> ---
>>>> kernel/sched/fair.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 04fa8dbcfa4d..8fb80636b010 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4527,7 +4527,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>> return;
>>>> #endif
>>>>
>>>> - if (cfs_rq->nr_running > 1)
>>>> + if (cfs_rq->nr_running > cfs_rq.idle_h_nr_running + 1)
>>>
>>> cfs_rq is a pointer.
>> It is. Sorry about that. :)
>>
>>>
>>>> check_preempt_tick(cfs_rq, curr);
>>>> }
>>>
>>> You can't compare cfs_rq->nr_running with cfs_rq->idle_h_nr_running!
>>>
>>> There is a difference between cfs_rq->h_nr_running and
>>> cfs_rq->nr_running. The '_h_' stands for hierarchical.
>>>
>>> The former gives you hierarchical task accounting whereas the latter is
>>> the number of sched entities (representing tasks or taskgroups) enqueued
>>> in cfs_rq.
>>>
>>> In entity_tick(), cfs_rq->nr_running has to be used for the condition to
>>> call check_preempt_tick(). We want to check if curr has to be preempted
>>> by __pick_first_entity(cfs_rq) on this cfs_rq.
>>>
>>> entity_tick() is called for each sched entity (and so for each
>>> cfs_rq_of(se)) of the task group hierarchy (e.g. task p running in
>>> taskgroup /A/B : se(p) -> se(A/B) -> se(A)).
>> That’s true. I was thinking adding a new cfs_rq->idle_nr_running member to
>> track the per cfs_rq's IDLE task number, and reducing preemption here based
>> on that.
>
> How would you deal with se's representing taskgroups which contain
> SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that?
I’m not sure I get the point. :) How about the following patch,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fa8dbcfa4d..8715f03ed6d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
list_add(&se->group_node, &rq->cfs_tasks);
}
#endif
+ if (task_has_idle_policy(task_of(se)))
+ cfs_rq->idle_nr_running++;
+
cfs_rq->nr_running++;
}

@@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
list_del_init(&se->group_node);
}
#endif
+ if (task_has_idle_policy(task_of(se)))
+ cfs_rq->idle_nr_running--;
+
cfs_rq->nr_running--;
}

@@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
return;
#endif

- if (cfs_rq->nr_running > 1)
+ if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 &&
+ cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > cfs_rq->idle_nr_running + 1)
check_preempt_tick(cfs_rq, curr);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 877fb08eb1b0..401090393e09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -500,6 +500,7 @@ struct cfs_bandwidth { };
struct cfs_rq {
struct load_weight load;
unsigned int nr_running;
+ unsigned int idle_nr_running;
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
>
>> I’m not sure if it’s ok to do that, because the IDLE class seems not to be so
>> pure that could tolerate starving.
>
> Not sure I understand but idle_sched_class is not the same as SCHED_IDLE
> (policy)?
The case is that we need tasks(low priority, called offline tasks) to utilize the
spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without
interfering the online tasks.
Offline tasks only run when there’s no runnable online tasks, and offline tasks
never preempt online tasks.
The SCHED_IDLE policy seems not to be abled to be qualified for that requirement,
because it has a weight(3), even though it’s small, but it can still preempt online
tasks considering the fairness. In that way, offline tasks of SCHED_IDLE policy
could interfere the online tasks.
On the other hand, idle_sched_class seems not to be qualified either. It’s too
simple and only used for per-cpu idle task currently.

Thx.
Regards,
Jiang

>
>> We need an absolutely low priority class that could tolerate starving, which
>> could be used to co-locate offline tasks. But IDLE class seems to be not
>> *low* enough, if considering the fairness of CFS, and IDLE class still has a
>> weight.
>
> [...]
>