Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Vincent Guittot
Date: Tue Sep 21 2021 - 04:07:24 EST


On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the
> reasons why this sysctl may be used may be for "optimising for throughput",
> particularly when overloaded. The tool TuneD sometimes alters this for two
> profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> does but it changed in master where it also will poke at debugfs instead.
>
> During task migration or wakeup, a decision is made on whether
> to preempt the current task or not. To limit over-scheduled,
> sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> of runtime before preempting. However, when a domain is heavily overloaded
> (e.g. hackbench), the degree of over-scheduling is still severe. This is

sysctl_sched_wakeup_granularity = 1 msec * (1 + ilog(ncpus))
AFAIK, a 2-socket CascadeLake has 56 cpus which means that
sysctl_sched_wakeup_granularity is 6ms for your platform

> problematic as a lot of time can be wasted rescheduling tasks that could
> instead be used by userspace tasks.
>
> This patch scales the wakeup granularity based on the number of running
> tasks on the CPU up to a max of 8ms by default. The intent is to

This becomes 8*6=48ms on your platform which is far more than the 15ms
below. Also 48ms is quite a long time to wait for a newly woken task
especially when this task is a bottleneck.

> allow tasks to run for longer while overloaded so that some tasks may
> complete faster and reduce the degree a domain is overloaded. Note that
> the TuneD throughput-performance profile allows up to 15ms but there
> is no explanation why such a long period was necessary so this patch is
> conservative and uses the value that check_preempt_wakeup() also takes
> into account. An internet search for instances where this parameter are
> tuned to high values offer either no explanation or a broken one.
>
> This improved hackbench on a range of machines when communicating via
> pipes (sockets show little to no difference). For a 2-socket CascadeLake
> machine, the results were
>
> hackbench-process-pipes
> 5.15.0-rc1 5.15.0-rc1
> vanilla sched-scalewakegran-v1r4
> Amean 1 0.3253 ( 0.00%) 0.3337 ( -2.56%)
> Amean 4 0.8300 ( 0.00%) 0.7983 ( 3.82%)
> Amean 7 1.1003 ( 0.00%) 1.1600 * -5.42%*
> Amean 12 1.7263 ( 0.00%) 1.6457 * 4.67%*
> Amean 21 3.0063 ( 0.00%) 2.7933 * 7.09%*
> Amean 30 4.2323 ( 0.00%) 3.8010 * 10.19%*
> Amean 48 6.5657 ( 0.00%) 5.6453 * 14.02%*
> Amean 79 10.4867 ( 0.00%) 8.5960 * 18.03%*
> Amean 110 14.8880 ( 0.00%) 11.4173 * 23.31%*
> Amean 141 19.2083 ( 0.00%) 14.3850 * 25.11%*
> Amean 172 23.4847 ( 0.00%) 17.1980 * 26.77%*
> Amean 203 27.3763 ( 0.00%) 20.1677 * 26.33%*
> Amean 234 31.3707 ( 0.00%) 23.4053 * 25.39%*
> Amean 265 35.4663 ( 0.00%) 26.3513 * 25.70%*
> Amean 296 39.2380 ( 0.00%) 29.3670 * 25.16%*
>
> For Zen 3;
>
> hackbench-process-pipes
> 5.15.0-rc1 5.15.0-rc1
> vanillasched-scalewakegran-v1r4
> Amean 1 0.3780 ( 0.00%) 0.4080 ( -7.94%)
> Amean 4 0.5393 ( 0.00%) 0.5217 ( 3.28%)
> Amean 7 0.5480 ( 0.00%) 0.5577 ( -1.76%)
> Amean 12 0.5803 ( 0.00%) 0.5667 ( 2.35%)
> Amean 21 0.7073 ( 0.00%) 0.6543 * 7.49%*
> Amean 30 0.8663 ( 0.00%) 0.8290 ( 4.31%)
> Amean 48 1.2720 ( 0.00%) 1.1337 * 10.88%*
> Amean 79 1.9403 ( 0.00%) 1.7247 * 11.11%*
> Amean 110 2.6827 ( 0.00%) 2.3450 * 12.59%*
> Amean 141 3.6863 ( 0.00%) 3.0253 * 17.93%*
> Amean 172 4.5853 ( 0.00%) 3.4987 * 23.70%*
> Amean 203 5.4893 ( 0.00%) 3.9630 * 27.81%*
> Amean 234 6.6017 ( 0.00%) 4.4230 * 33.00%*
> Amean 265 7.3850 ( 0.00%) 4.8317 * 34.57%*
> Amean 296 8.5823 ( 0.00%) 5.3327 * 37.86%*
>
> For other workloads, the benefits were marginal as the extreme overloaded
> case is not hit to the same extent.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
> 1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 038edfaaae9e..8e12aeebf4ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4511,7 +4511,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> }
>
> static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> + struct sched_entity *se);
>
> /*
> * Pick the next process, keeping these things in mind, in this order:
> @@ -4550,16 +4551,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> second = curr;
> }
>
> - if (second && wakeup_preempt_entity(second, left) < 1)
> + if (second && wakeup_preempt_entity(NULL, second, left) < 1)
> se = second;
> }
>
> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
> /*
> * Someone really wants this to run. If it's not unfair, run it.
> */
> se = cfs_rq->next;
> - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
> /*
> * Prefer last buddy, try to return the CPU to a preempted task.
> */
> @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> }
> #endif /* CONFIG_SMP */
>
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> unsigned long gran = sysctl_sched_wakeup_granularity;
>
> + /*
> + * If rq is specified, scale the granularity relative to the number
> + * of running tasks but no more than 8ms with default
> + * sysctl_sched_wakeup_granularity settings. The wakeup gran
> + * reduces over-scheduling but if tasks are stacked then the
> + * domain is likely overloaded and over-scheduling may
> + * prolong the overloaded state.
> + */
> + if (cfs_rq && cfs_rq->nr_running > 1)
> + gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> +
> /*
> * Since its curr running now, convert the gran from real-time
> * to virtual-time in his units.
> @@ -7079,14 +7092,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> *
> */
> static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> + struct sched_entity *se)
> {
> s64 gran, vdiff = curr->vruntime - se->vruntime;
>
> if (vdiff <= 0)
> return -1;
>
> - gran = wakeup_gran(se);
> + gran = wakeup_gran(cfs_rq, se);
> if (vdiff > gran)
> return 1;
>
> @@ -7191,7 +7205,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> return;
>
> update_curr(cfs_rq);
> - if (wakeup_preempt_entity(se, pse) == 1) {
> + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
> /*
> * Bias pick_next to pick the sched entity that is
> * triggering this preemption.
> --
> 2.31.1
>