Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Mike Galbraith
Date: Wed Sep 22 2021 - 21:48:03 EST


On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote:
> On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> >
> > I'm not seeing an alternative suggestion that could be turned into
> > an implementation. The current value for sched_wakeup_granularity
> > was set 12 years ago was exposed for tuning which is no longer
> > the case. The intent was to allow some dynamic adjustment between
> > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > over-scheduling in the worst case without disabling preemption entirely
> > (which the first version did).

I don't think those knobs were ever _intended_ for general purpose
tuning, but they did get used that way by some folks.

> >
> > Should we just ignore this problem and hope it goes away or just let
> > people keep poking silly values into debugfs via tuned?
>
> We should certainly not add a bandaid because people will continue to
> poke silly value at the end. And increasing
> sysctl_sched_wakeup_granularity based on the number of running threads
> is not the right solution.

Watching my desktop box stack up large piles of very short running
threads, I agree, instantaneous load looks like a non-starter.

> According to the description of your
> problem that the current task doesn't get enough time to move forward,
> sysctl_sched_min_granularity should be part of the solution. Something
> like below will ensure that current got a chance to move forward

Nah, progress is guaranteed, the issue is a zillion very similar short
running threads preempting each other with no win to be had, thus
spending cycles in the scheduler that are utterly wasted. It's a valid
issue, trouble is teaching the scheduler to recognize that situation
without mucking up other situations where there IS a win for even very
short running threads say, doing a synchronous handoff; preemption is
cheaper than scheduling off if the waker is going be awakened again in
very short order.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9bf540f04c2d..39d4e4827d3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
> struct task_struct *p, int wake_
>         int scale = cfs_rq->nr_running >= sched_nr_latency;
>         int next_buddy_marked = 0;
>         int cse_is_idle, pse_is_idle;
> +       unsigned long delta_exec;
>
>         if (unlikely(se == pse))
>                 return;
> @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> struct task_struct *p, int wake_
>                 return;
>
>         update_curr(cfs_rq_of(se));
> +       delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +       /*
> +        * Ensure that current got a chance to move forward
> +        */
> +       if (delta_exec < sysctl_sched_min_granularity)
> +               return;
> +
>         if (wakeup_preempt_entity(se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is

Yikes! If you do that, you may as well go the extra nanometer and rip
wakeup preemption out entirely, same result, impressive diffstat.

-Mike