Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running

From: Vincent Guittot
Date: Thu Sep 23 2021 - 04:41:04 EST


On Thu, 23 Sept 2021 at 03:47, Mike Galbraith <efault@xxxxxx> wrote:
>
> On Wed, 2021-09-22 at 20:22 +0200, Vincent Guittot wrote:
> > On Wed, 22 Sept 2021 at 19:38, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > >
> > > I'm not seeing an alternative suggestion that could be turned into
> > > an implementation. The current value for sched_wakeup_granularity
> > > was set 12 years ago was exposed for tuning which is no longer
> > > the case. The intent was to allow some dynamic adjustment between
> > > sysctl_sched_wakeup_granularity and sysctl_sched_latency to reduce
> > > over-scheduling in the worst case without disabling preemption entirely
> > > (which the first version did).
>
> I don't think those knobs were ever _intended_ for general purpose
> tuning, but they did get used that way by some folks.
>
> > >
> > > Should we just ignore this problem and hope it goes away or just let
> > > people keep poking silly values into debugfs via tuned?
> >
> > We should certainly not add a bandaid because people will continue to
> > poke silly value at the end. And increasing
> > sysctl_sched_wakeup_granularity based on the number of running threads
> > is not the right solution.
>
> Watching my desktop box stack up large piles of very short running
> threads, I agree, instantaneous load looks like a non-starter.
>
> > According to the description of your
> > problem that the current task doesn't get enough time to move forward,
> > sysctl_sched_min_granularity should be part of the solution. Something
> > like below will ensure that current got a chance to move forward
>
> Nah, progress is guaranteed, the issue is a zillion very similar short
> running threads preempting each other with no win to be had, thus
> spending cycles in the scheduler that are utterly wasted. It's a valid
> issue, trouble is teaching the scheduler to recognize that situation
> without mucking up other situations where there IS a win for even very
> short running threads say, doing a synchronous handoff; preemption is
> cheaper than scheduling off if the waker is going be awakened again in
> very short order.
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9bf540f04c2d..39d4e4827d3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7102,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq,
> > struct task_struct *p, int wake_
> > int scale = cfs_rq->nr_running >= sched_nr_latency;
> > int next_buddy_marked = 0;
> > int cse_is_idle, pse_is_idle;
> > + unsigned long delta_exec;
> >
> > if (unlikely(se == pse))
> > return;
> > @@ -7161,6 +7162,13 @@ static void check_preempt_wakeup(struct rq *rq,
> > struct task_struct *p, int wake_
> > return;
> >
> > update_curr(cfs_rq_of(se));
> > + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > + /*
> > + * Ensure that current got a chance to move forward
> > + */
> > + if (delta_exec < sysctl_sched_min_granularity)
> > + return;
> > +
> > if (wakeup_preempt_entity(se, pse) == 1) {
> > /*
> > * Bias pick_next to pick the sched entity that is
>
> Yikes! If you do that, you may as well go the extra nanometer and rip
> wakeup preemption out entirely, same result, impressive diffstat.

This patch is mainly there to show that there are other ways to ensure
progress without using some load heuristic.
sysctl_sched_min_granularity has the problem of scaling with the
number of cpus and this can generate large values. At least we should
use the normalized_sysctl_sched_min_granularity or even a smaller
value but wakeup preemption still happens with this change. It only
ensures that we don't waste time preempting each other without any
chance to do actual stuff.

a 100us value should even be enough to fix Mel's problem without
impacting common wakeup preemption cases.


>
> -Mike