Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Mel Gorman
Date: Tue Sep 21 2021 - 06:45:25 EST


On Tue, Sep 21, 2021 at 10:03:56AM +0200, Vincent Guittot wrote:
> On Mon, 20 Sept 2021 at 16:26, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the
> > reasons why this sysctl may be used may be for "optimising for throughput",
> > particularly when overloaded. The tool TuneD sometimes alters this for two
> > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > does but it changed in master where it also will poke at debugfs instead.
> >
> > During task migration or wakeup, a decision is made on whether
> > to preempt the current task or not. To limit over-scheduled,
> > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> > of runtime before preempting. However, when a domain is heavily overloaded
> > (e.g. hackbench), the degree of over-scheduling is still severe. This is
>
> sysctl_sched_wakeup_granularity = 1 msec * (1 + ilog(ncpus))
> AFAIK, a 2-socket CascadeLake has 56 cpus which means that
> sysctl_sched_wakeup_granularity is 6ms for your platform
>

On my machine it becomes 7ms but lets assume there were 56 cpus to avoid
confusion.

> > problematic as a lot of time can be wasted rescheduling tasks that could
> > instead be used by userspace tasks.
> >
> > This patch scales the wakeup granularity based on the number of running
> > tasks on the CPU up to a max of 8ms by default. The intent is to
>
> This becomes 8*6=48ms on your platform which is far more than the 15ms
> below. Also 48ms is quite a long time to wait for a newly woken task
> especially when this task is a bottleneck.
>

With the patch on top I proposed to Mike to take FAIR_SLEEPERS into
account, it becomes ((sysctl_sched_latency / gran) >> 1) by default which
becomes 18ms for heavy overloading or potentially 12ms if there is enough
load to stack 2 tasks. The patch generates a warning as I didn't even
build test it, but hey, it was for illustrative purposes.

Is that any better conceptually or should we ignore the problem? My
motivation here really is to reduce the motivation of others to "tune"
debugfs values or be tempted to revert the move to debugfs.

--
Mel Gorman
SUSE Labs