Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

From: benbjiang(蒋彪)
Date: Wed Aug 19 2020 - 20:13:52 EST




> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>
> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <benbjiang@xxxxxxxxxxx> wrote:
>>
>>
>>
>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>
>>> On 19/08/2020 13:05, Vincent Guittot wrote:
>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>
>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>>>>>>
>>>>>>
>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>>>
>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>>>>>>> From: Jiang Biao <benbjiang@xxxxxxxxxxx>
>>>>>
>>>>> [...]
>>>>>
>>>>>>> Are you sure about this?
>>>>>> Yes. :)
>>>>>>>
>>>>>>> The math is telling me for the:
>>>>>>>
>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>>>>>>
>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>>>>>>
>>>>>>> (4ms - 250 Hz)
>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>>>>>
>>>>> OK, I see.
>>>>>
>>>>> But here the different sched slices (check_preempt_tick()->
>>>>> sched_slice()) between normal tasks and the idle task play a role to.
>>>>>
>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
>>>>
>>>> In fact that depends on the number of CPUs on the system
>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
>>>> normal task will run around 12ms in one shoot and the idle task still
>>>> one tick period
>>>
>>> True. This is on a single CPU.
>> Agree. :)
>>
>>>
>>>> Also, you can increase even more the period between 2 runs of idle
>>>> task by using cgroups and min shares value : 2
>>>
>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
>>> not have other requirements preventing this.
>> That could work for increasing the period between 2 runs. But could not
>> reduce the single runtime of idle task I guess, which means normal task
>> could have 1-tick schedule latency because of idle task.
>
> Yes. An idle task will preempt an always running task during 1 tick
> every 680ms. But also you should keep in mind that a waking normal
> task will preempt the idle task immediately which means that it will
> not add scheduling latency to a normal task but "steal" 0.14% of
> normal task throughput (1/680) at most
That’s true. But in the VM case, when VM are busy(MWAIT passthrough
or running cpu eating works), the 1-tick scheduling latency could be
detected by cyclictest running in the VM.

OTOH, we compensate vruntime in place_entity() to boot waking
without distinguish SCHED_IDLE task, do you think it’s necessary to
do that? like

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
vruntime += sched_vslice(cfs_rq, se);

/* sleeps up to a single latency don't count. */
- if (!initial) {
+ if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
unsigned long thresh = sysctl_sched_latency;

>
>> OTOH, cgroups(shares) could introduce extra complexity. :)
>>
>> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
>> lower than SCHED_NORMAL(OTHER), which means no weights/shares
>> for them, and they run only when no other task’s runnable.
>> I guess there may be priority inversion issue if we do that. But maybe we
>
> Exactly, that's why we must ensure a minimum running time for sched_idle task

Still for VM case, different VMs have been much isolated from each other,
priority inversion issue could be very rare, we’re trying to make offline tasks
absoultly harmless to online tasks. :)

Thanks a lot for your time.
Regards,
Jiang

>
>> could avoid it by load-balance more aggressively, or it(priority inversion)
>> could be ignored in some special case.
>>
>>>