Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Mike Galbraith
Date: Mon Sep 20 2021 - 23:53:10 EST


On Mon, 2021-09-20 at 15:26 +0100, Mel Gorman wrote:
>
> This patch scales the wakeup granularity based on the number of running
> tasks on the CPU up to a max of 8ms by default.  The intent is to
> allow tasks to run for longer while overloaded so that some tasks may
> complete faster and reduce the degree a domain is overloaded. Note that
> the TuneD throughput-performance profile allows up to 15ms but there
> is no explanation why such a long period was necessary so this patch is
> conservative and uses the value that check_preempt_wakeup() also takes
> into account.  An internet search for instances where this parameter are
> tuned to high values offer either no explanation or a broken one.
>
> This improved hackbench on a range of machines when communicating via
> pipes (sockets show little to no difference). For a 2-socket CascadeLake
> machine, the results were

Twiddling wakeup preemption based upon the performance of a fugly fork
bomb seems like a pretty bad idea to me.

Preemption does rapidly run into diminishing return as load climbs for
a lot of loads, but as you know, it's a rather sticky wicket because
even when over-committed, preventing light control threads from slicing
through (what can be a load's own work crew of) hogs can seriously
injure performance.

<snip>

> @@ -7044,10 +7045,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  }
>  #endif /* CONFIG_SMP */
>  
> -static unsigned long wakeup_gran(struct sched_entity *se)
> +static unsigned long
> +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         unsigned long gran = sysctl_sched_wakeup_granularity;
>  
> +       /*
> +        * If rq is specified, scale the granularity relative to the number
> +        * of running tasks but no more than 8ms with default
> +        * sysctl_sched_wakeup_granularity settings. The wakeup gran
> +        * reduces over-scheduling but if tasks are stacked then the
> +        * domain is likely overloaded and over-scheduling may
> +        * prolong the overloaded state.
> +        */
> +       if (cfs_rq && cfs_rq->nr_running > 1)
> +               gran *= min(cfs_rq->nr_running >> 1, sched_nr_latency);
> +

Maybe things have changed while I wasn't watching closely, but...

The scaled up tweakables on my little quad desktop box:
sched_nr_latency = 8
sched_wakeup_granularity = 4ms
sched_latency = 24ms

Due to the FAIR_SLEEPERS feature, a task can only receive a max of
sched_latency/2 sleep credit, ie the delta between waking sleeper and
current is clipped to a max of 12 virtual ms, so the instant our
preempt threshold reaches 12.000ms, by human booboo or now 3 runnable
tasks with this change, wakeup preemption is completely disabled, or?

-Mike