Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

From: Zhang Qiao
Date: Wed Mar 15 2023 - 05:16:46 EST




在 2023/3/14 23:32, Vincent Guittot 写道:
> Le mardi 14 mars 2023 à 14:39:49 (+0100), Vincent Guittot a écrit :
>> On Tue, 14 Mar 2023 at 14:38, Zhang Qiao <zhangqiao22@xxxxxxxxxx> wrote:
>>>
>>>
>>>
>>> 在 2023/3/14 21:26, Vincent Guittot 写道:
>>>> On Tue, 14 Mar 2023 at 12:03, Zhang Qiao <zhangqiao22@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2023/3/13 22:23, Vincent Guittot 写道:
>>>>>> On Sat, 11 Mar 2023 at 10:57, Zhang Qiao <zhangqiao22@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2023/3/10 22:29, Vincent Guittot 写道:
>>>>>>>> Le jeudi 09 mars 2023 à 16:14:38 (+0100), Vincent Guittot a écrit :
>>>>>>>>> On Thu, 9 Mar 2023 at 15:37, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 09, 2023 at 03:28:25PM +0100, Peter Zijlstra wrote:
>>>>>>>>>>> On Thu, Mar 09, 2023 at 02:34:05PM +0100, Vincent Guittot wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Then, even if we don't clear exec_start before migrating and keep
>>>>>>>>>>>> current value to be used in place_entity on the new cpu, we can't
>>>>>>>>>>>> compare the rq_clock_task(rq_of(cfs_rq)) of 2 different rqs AFAICT
>>>>>>>>>>>
>>>>>>>>>>> Blergh -- indeed, irq and steal time can skew them between CPUs :/
>>>>>>>>>>> I suppose we can fudge that... wait_start (which is basically what we're
>>>>>>>>>>> making it do) also does that IIRC.
>>>>>>>>>>>
>>>>>>>>>>> I really dislike having this placement muck spreadout like proposed.
>>>>>>>>>>
>>>>>>>>>> Also, I think we might be over-engineering this, we don't care about
>>>>>>>>>> accuracy at all, all we really care about is 'long-time'.
>>>>>>>>>
>>>>>>>>> you mean taking the patch 1/2 that you mentioned here to add a
>>>>>>>>> migrated field:
>>>>>>>>> https://lore.kernel.org/all/68832dfbb60fda030540b5f4e39c5801942689b1.1648228023.git.tim.c.chen@xxxxxxxxxxxxxxx/T/#ma5637eb8010f3f4a4abff778af8db705429d003b
>>>>>>>>>
>>>>>>>>> And assume that the divergence between the rq_clock_task() can be ignored ?
>>>>>>>>>
>>>>>>>>> That could probably work but we need to replace the (60LL *
>>>>>>>>> NSEC_PER_SEC) by ((1ULL << 63) / NICE_0_LOAD) because 60sec divergence
>>>>>>>>> would not be unrealistic.
>>>>>>>>> and a comment to explain why it's acceptable
>>>>>>>>
>>>>>>>> Zhang,
>>>>>>>>
>>>>>>>> Could you try the patch below ?
>>>>>>>> This is a rebase/merge/update of:
>>>>>>>> -patch 1/2 above and
>>>>>>>> -https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@xxxxxxxxx/
>>>>>>>
>>>>>>>
>>>>>>> I applyed and tested this patch, and it make hackbench slower.
>>>>>>> According to my previous test results. The good result is 82.1(s).
>>>>>>> But the result of this patch is 108.725(s).
>>>>>>
>>>>>> By "the result of this patch is 108.725(s)", you mean the result of
>>>>>> https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@xxxxxxxxx/
>>>>>> alone, don't you ?
>>>>>
>>>>> No, with your patch, the test results is 108.725(s),
>>>>
>>>> Ok
>>>>
>>>>>
>>>>> git diff:
>>>>>
>>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>>> index 63d242164b1a..93a3909ae4c4 100644
>>>>> --- a/include/linux/sched.h
>>>>> +++ b/include/linux/sched.h
>>>>> @@ -550,6 +550,7 @@ struct sched_entity {
>>>>> struct rb_node run_node;
>>>>> struct list_head group_node;
>>>>> unsigned int on_rq;
>>>>> + unsigned int migrated;
>>>>>
>>>>> u64 exec_start;
>>>>> u64 sum_exec_runtime;
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index ff4dbbae3b10..e60defc39f6e 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -1057,6 +1057,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>> /*
>>>>> * We are starting a new run period:
>>>>> */
>>>>> + se->migrated = 0;
>>>>> se->exec_start = rq_clock_task(rq_of(cfs_rq));
>>>>> }
>>>>>
>>>>> @@ -4690,9 +4691,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>>>>> * inversed due to s64 overflow.
>>>>> */
>>>>> sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
>>>>> - if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
>>>>> + if ((s64)sleep_time > (1ULL << 63) / scale_load_down(NICE_0_LOAD) / 2) {
>>>>> se->vruntime = vruntime;
>>>>> - else
>>>>> + } else
>>>>> se->vruntime = max_vruntime(se->vruntime, vruntime);
>>>>> }
>>>>>
>>>>> @@ -7658,8 +7659,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>>>>> se->avg.last_update_time = 0;
>>>>>
>>>>> /* We have migrated, no longer consider this task hot */
>>>>> - se->exec_start = 0;
>>>>> -
>>>>> + se->migrated = 1;
>>>>> update_scan_period(p, new_cpu);
>>>>> }
>>>>>
>>>>> @@ -8343,6 +8343,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>>>>>
>>>>> if (sysctl_sched_migration_cost == 0)
>>>>> return 0;
>>>>> + if (p->se.migrated)
>>>>> + return 0;
>>>>>
>>>>> delta = rq_clock_task(env->src_rq) - p->se.exec_start;
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> version1: v6.2
>>>>>>>> version2: v6.2 + commit 829c1651e9c4
>>>>>>>> version3: v6.2 + commit 829c1651e9c4 + this patch
>>>>>>>>
>>>>>>>> -------------------------------------------------
>>>>>>>> version1 version2 version3
>>>>>>>> test1 81.0 118.1 82.1
>>>>>>>> test2 82.1 116.9 80.3
>>>>>>>> test3 83.2 103.9 83.3
>>>>>>>> avg(s) 82.1 113.0 81.9
>>>>>>
>>>>>> Ok, it looks like we are back to normal figures
>>>>
>>>> What do those results refer to then ?
>>>
>>> Quote from this email (https://lore.kernel.org/lkml/1cd19d3f-18c4-92f9-257a-378cc18cfbc7@xxxxxxxxxx/).
>>
>> ok.
>>
>> Then, there is something wrong in my patch. Let me look at it more deeply
>
> Coudl you try the patc below. It fixes the problem on my system
>

Yes, it fixes the problem.

------

Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):

79.53 +- 1.21 seconds time elapsed ( +- 1.52% )



> ---
> kernel/sched/fair.c | 84 +++++++++++++++++++++++++++++++--------------
> 1 file changed, 59 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f499e9a74b5..f8722e47bb0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4648,23 +4648,36 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
> #endif
> }
>
> +static inline bool entity_is_long_sleeper(struct sched_entity *se)
> +{
> + struct cfs_rq *cfs_rq;
> + u64 sleep_time;
> +
> + if (se->exec_start == 0)
> + return false;
> +
> + cfs_rq = cfs_rq_of(se);
> +
> + sleep_time = rq_clock_task(rq_of(cfs_rq));
> +
> + /* Happen while migrating because of clock task divergence */
> + if (sleep_time <= se->exec_start)
> + return false;
> +
> + sleep_time -= se->exec_start;
> + if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
> + return true;
> +
> + return false;
> +}
> +
> static void
> -place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> +place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> u64 vruntime = cfs_rq->min_vruntime;
> - u64 sleep_time;
> -
> - /*
> - * The 'current' period is already promised to the current tasks,
> - * however the extra weight of the new task will slow them down a
> - * little, place the new task so that it fits in the slot that
> - * stays open at the end.
> - */
> - if (initial && sched_feat(START_DEBIT))
> - vruntime += sched_vslice(cfs_rq, se);
>
> /* sleeps up to a single latency don't count. */
> - if (!initial) {
> + if (flags & ENQUEUE_WAKEUP) {
> unsigned long thresh;
>
> if (se_is_idle(se))
> @@ -4680,20 +4693,43 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> thresh >>= 1;
>
> vruntime -= thresh;
> + } else if sched_feat(START_DEBIT) {

There's a parenthesis missing here.


> + /*
> + * The 'current' period is already promised to the current tasks,
> + * however the extra weight of the new task will slow them down a
> + * little, place the new task so that it fits in the slot that
> + * stays open at the end.
> + */
> + vruntime += sched_vslice(cfs_rq, se);
> }
>
> /*
> * Pull vruntime of the entity being placed to the base level of
> - * cfs_rq, to prevent boosting it if placed backwards. If the entity
> - * slept for a long time, don't even try to compare its vruntime with
> - * the base as it may be too far off and the comparison may get
> - * inversed due to s64 overflow.
> - */
> - sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> - if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> + * cfs_rq, to prevent boosting it if placed backwards.
> + * However, min_vruntime can advance much faster than real time, with
> + * the exterme being when an entity with the minimal weight always runs
> + * on the cfs_rq. If the new entity slept for long, its vruntime
> + * difference from min_vruntime may overflow s64 and their comparison
> + * may get inversed, so ignore the entity's original vruntime in that
> + * case.
> + * The maximal vruntime speedup is given by the ratio of normal to
> + * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
> + * When placing a migrated waking entity, its exec_start has been set
> + * from a different rq. In order to take into account a possible
> + * divergence between new and prev rq's clocks task because of irq and
> + * stolen time, we take an additional margin.
> + * So, cutting off on the sleep time of
> + * 2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
> + * should be safe.
> +
> + */
> + if (entity_is_long_sleeper(se))
> se->vruntime = vruntime;
> else
> se->vruntime = max_vruntime(se->vruntime, vruntime);
> +
> + if (flags & ENQUEUE_MIGRATED)
> + se->exec_start = 0;
> }
>
> static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> @@ -4769,7 +4805,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> account_entity_enqueue(cfs_rq, se);
>
> if (flags & ENQUEUE_WAKEUP)
> - place_entity(cfs_rq, se, 0);
> + place_entity(cfs_rq, se, flags);
>
> check_schedstat_required();
> update_stats_enqueue_fair(cfs_rq, se, flags);
> @@ -7665,9 +7701,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> /* Tell new CPU we are migrated */
> se->avg.last_update_time = 0;
>
> - /* We have migrated, no longer consider this task hot */
> - se->exec_start = 0;
> -
> update_scan_period(p, new_cpu);
> }
>
> @@ -11993,7 +12026,7 @@ static void task_fork_fair(struct task_struct *p)
> update_curr(cfs_rq);
> se->vruntime = curr->vruntime;
> }
> - place_entity(cfs_rq, se, 1);
> + place_entity(cfs_rq, se, 0);
>
> if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
> /*
> @@ -12137,8 +12170,9 @@ static void detach_task_cfs_rq(struct task_struct *p)
> /*
> * Fix up our vruntime so that the current sleep doesn't
> * cause 'unlimited' sleep bonus.
> + * This is the same as placing a waking task.
> */
> - place_entity(cfs_rq, se, 0);
> + place_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> se->vruntime -= cfs_rq->min_vruntime;
> }
>
> --
> 2.34.1
>
>
>>
>>>
>>>>
> .
>