Re: [PATCH 2/2] sched/fair: Increase wakeup_gran if current task has not executed the minimum granularity

From: Vincent Guittot
Date: Fri Oct 29 2021 - 12:07:45 EST


On Thu, 28 Oct 2021 at 11:49, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs")
> moved the kernel.sched_wakeup_granularity_ns sysctl under debugfs.
> One of the reasons why this sysctl may be used may be for "optimising
> for throughput", particularly when overloaded. The tool TuneD sometimes
> alters this for two profiles e.g. "mssql" and "throughput-performance". At
> least version 2.9 does but it changed in master where it also will poke
> at debugfs instead. This patch aims to reduce the motivation to tweak
> sysctl_sched_wakeup_granularity by increasing sched_wakeup_granularity
> if the running task runtime has not exceeded sysctl_sched_min_granularity.
>
> During task migration or wakeup, a decision is made on whether
> to preempt the current task or not. To limit over-scheduled,
> sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> of runtime before preempting. However, when a domain is heavily overloaded
> (e.g. hackbench), the degree of over-scheduling is still severe. This is
> problematic as time is wasted rescheduling tasks that could instead be
> used by userspace tasks.
>
> However, care must be taken. Even if a system is overloaded, there may
> be high priority threads that must still be able to run. Mike Galbraith
> explained the constraints as follows;
>
> CFS came about because the O1 scheduler was unfair to the
> point it had starvation problems. People pretty much across the
> board agreed that a fair scheduler was a much way better way
> to go, and CFS was born. It didn't originally have the sleep
> credit business, but had to grow it to become _short term_ fair.
> Ingo cut the sleep credit in half because of overscheduling, and
> that has worked out pretty well all told.. but now you're pushing
> it more in the unfair direction, all the way to extremely unfair
> for anything and everything very light.
>
> Fairness isn't the holy grail mind you, and at some point, giving
> up on short term fairness certainly isn't crazy, as proven by your
> hackbench numbers and other numbers we've seen over the years,
> but taking bites out of the 'CF' in the CFS that was born to be a
> corner-case killer is.. worrisome. The other shoe will drop.. it
> always does :)
>
> This patch increases the wakeup granularity if the current task has not
> reached its minimum preemption granularity. The current task may still
> be preempted but the difference in runtime must be higher.
>
> hackbench-process-pipes
> 5.15.0-rc3 5.15.0-rc3
> sched-wakeeflips-v1r1sched-scalewakegran-v3r2
> Amean 1 0.3890 ( 0.00%) 0.3823 ( 1.71%)
> Amean 4 0.5217 ( 0.00%) 0.4867 ( 6.71%)
> Amean 7 0.5387 ( 0.00%) 0.5053 ( 6.19%)
> Amean 12 0.5443 ( 0.00%) 0.5450 ( -0.12%)
> Amean 21 0.6487 ( 0.00%) 0.6807 ( -4.93%)
> Amean 30 0.8033 ( 0.00%) 0.7107 * 11.54%*
> Amean 48 1.2400 ( 0.00%) 1.0447 * 15.75%*
> Amean 79 1.8200 ( 0.00%) 1.6033 * 11.90%*
> Amean 110 2.5820 ( 0.00%) 2.0763 * 19.58%*
> Amean 141 3.2203 ( 0.00%) 2.5313 * 21.40%*
> Amean 172 3.8200 ( 0.00%) 3.1163 * 18.42%*
> Amean 203 4.3357 ( 0.00%) 3.5560 * 17.98%*
> Amean 234 4.8047 ( 0.00%) 3.8913 * 19.01%*
> Amean 265 5.1243 ( 0.00%) 4.2293 * 17.47%*
> Amean 296 5.5940 ( 0.00%) 4.5357 * 18.92%*
>
> 5.15.0-rc3 5.15.0-rc3
> sched-wakeeflips-v1r1 sched-scalewakegran-v3r2
> Duration User 2567.27 2034.17
> Duration System 21098.79 17137.08
> Duration Elapsed 136.49 120.2
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 17 +++++++++++++++--
> kernel/sched/features.h | 2 ++
> 2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d00af3b97d8f..dee108470297 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7052,10 +7052,23 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> }
> #endif /* CONFIG_SMP */
>
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
> {
> unsigned long gran = sysctl_sched_wakeup_granularity;
>
> + if (sched_feat(SCALE_WAKEUP_GRAN)) {
> + unsigned long delta_exec;
> +
> + /*
> + * Increase the wakeup granularity if curr's runtime
> + * is less than the minimum preemption granularity.
> + */
> + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> + if (delta_exec < sysctl_sched_min_granularity)
> + gran += sysctl_sched_min_granularity;

I need to think a bit more about corner cases but this change looks
much better than the previous one.

> + }
> +
> /*
> * Since its curr running now, convert the gran from real-time
> * to virtual-time in his units.
> @@ -7094,7 +7107,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> if (vdiff <= 0)
> return -1;
>
> - gran = wakeup_gran(se);
> + gran = wakeup_gran(curr, se);
> if (vdiff > gran)
> return 1;
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 7f8dace0964c..611591355ffd 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -95,3 +95,5 @@ SCHED_FEAT(LATENCY_WARN, false)
>
> SCHED_FEAT(ALT_PERIOD, true)
> SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(SCALE_WAKEUP_GRAN, true)
> --
> 2.31.1
>