Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

From: Patrick Bellasi
Date: Fri Jun 05 2020 - 03:55:52 EST



Hi Qais,

On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef <qais.yousef@xxxxxxx> wrote...

> On 06/03/20 16:59, Vincent Guittot wrote:
>> When I want to stress the fast path i usually use "perf bench sched pipe -T "
>> The tip/sched/core on my arm octo core gives the following results for
>> 20 iterations of perf bench sched pipe -T -l 50000
>>
>> all uclamp config disabled 50035.4(+/- 0.334%)
>> all uclamp config enabled 48749.8(+/- 0.339%) -2.64%

I use to run the same test but I don't remember such big numbers :/

>> It's quite easy to reproduce and probably easier to study the impact
>
> Thanks Vincent. This is very useful!
>
> I could reproduce that on my Juno.
>
> One of the codepath I was suspecting seems to affect it.
>
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..9f48090eb926 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> * e.g. due to future modification, warn and fixup the expected value.
> */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> +#if 0
> if (bucket->value >= rq_clamp) {
> bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> WRITE_ONCE(uc_rq->value, bkt_clamp);
> }
> +#endif

Yep, that's likely where we have most of the overhead at dequeue time,
sine _sometimes_ we need to update the cpu's clamp value.

However, while running perf sched pipe, I expect:
- all tasks to have the same clamp value
- all CPUs to have _always_ at least one RUNNABLE task

Given these two conditions above, if the CPU is never "CFS idle" (i.e.
without RUNNABLE CFS tasks), the code above should never be triggered.
More on that later...

> }
>
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>
>
> uclamp_rq_max_value() could be expensive as it loops over all buckets.

It loops over UCLAMP_CNT values which are defined to fit into a single
$L. That was the optimal space/time complexity compromise we found to
get the MAX of a set of values.

> Commenting this whole path out strangely doesn't just 'fix' it,
> but produces better results to no-uclamp kernel :-/
>
> # ./perf bench -r 20 sched pipe -T -l 50000
> Without uclamp: 5039
> With uclamp: 4832
> With uclamp+patch: 5729

I explain it below: with that code removed you never decrease the CPU's
uclamp value. Thus, the first time you schedule an RT task you go to MAX
OPP and stay there forever.

> It might be because schedutil gets biased differently by uclamp..? If I move to
> performance governor these numbers almost double.
>
> I don't know. But this promoted me to look closer and

Just to resume, when a task is dequeued we can have only these cases:

- uclamp(task) < uclamp(cpu):
this happens when the task was co-scheduled with other tasks with
higher clamp values which are still RUNNABLE.
In this case there are no uclamp(cpu) updates.

- uclamp(task) == uclamp(cpu):
this happens when the task was one of the tasks defining the current
uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
tasks clamp values.

In this last case we _not_ always need to do a uclamp(cpu) update.
Indeed the update is required _only_ when that task was _the last_ task
defining the current uclamp(cpu) value.

In this case we use uclamp_rq_max_value() to do a linear scan of
UCLAMP_CNT values which fits into a single cache line.

> I think I spotted a bug where in the if condition we check for '>='
> instead of '>', causing us to take the supposedly impossible fail safe
> path.

The fail safe path is when the '>' condition matches, which is what the
SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
than one of its RUNNABLE tasks. If that should happen we WARN and fix
the cpu clamp value for the best.

The normal path is instead '=' and, according to by previous resume,
it's expected to be executed _only_ when we dequeue the last task of the
clamp group defining the current uclamp(cpu) value.

> Mind trying with the below patch please?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..50d66d4016ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> * e.g. due to future modification, warn and fixup the expected value.
> */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> - if (bucket->value >= rq_clamp) {
> + if (bucket->value > rq_clamp) {
> bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> WRITE_ONCE(uc_rq->value, bkt_clamp);
> }

This patch is thus bogus, since we never expect to have uclamp(cpu)
bigger than uclamp(task) and thus we will never reset the clamp value of
a cpu.