Re: [PATCH v2 4/4] sched/fair: Prevent active LB from preempting higher sched classes

From: Vincent Guittot
Date: Thu Aug 29 2019 - 10:19:48 EST

On Wed, 28 Aug 2019 at 11:46, Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
> On 27/08/2019 13:28, Vincent Guittot wrote:
> > On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
> > <valentin.schneider@xxxxxxx> wrote:
> >>
> >> The CFS load balancer can cause the cpu_stopper to run a function to
> >> try and steal a remote rq's running task. However, it so happens
> >> that while only CFS tasks will ever be migrated by that function, we
> >> can end up preempting higher sched class tasks, since it is executed
> >> by the cpu_stopper.
> >>
> >> This can potentially occur whenever a rq's running task is > CFS but
> >> the rq has runnable CFS tasks.
> >>
> >> Check the sched class of the remote rq's running task after we've
> >> grabbed its lock. If it's CFS, carry on, otherwise run
> >> detach_one_task() locally since we don't need the cpu_stopper (that
> >> !CFS task is doing the exact same thing).
> >
> > AFAICT, this doesn't prevent from preempting !CFS task but only reduce
> > the window.
> > As soon as you unlock, !CFS task can preempt CFS before you start stop thread
> >
> Right, if we end up kicking the cpu_stopper this can still happen (since
> we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
> since the currently running is obviously not going to be CFS (and it's
> too late anyway, we already preempted whatever was running there). Though
> I should probably change the name of the patch to reflect that it's not a
> 100% cure.
> I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
> to bail out early, but AFAICT that's the best we can do without big changes
> elsewhere.
> If we wanted to prevent those preemptions at all cost, I suppose we'd want

I'm not sure that it's worth the effort and the complexity

> the equivalent of a sched class sitting between CFS and RT: have the
> callback only run when there's no runnable > CFS tasks. But then by the
> time we execute it we may no longer need to balance anything...
> At the very least, what I'm proposing here alleviates *some* of the
> preemption cases without swinging the wrecking ball too hard (and without
> delaying the balancing either).
> > testing busiest->cfs.h_nr_running < 1 and/or
> > busiest->curr->sched_class != &fair_sched_class
> > in need_active_balance() will do almost the same and is much simpler
> > than this patchset IMO.
> >
> I had this initially but convinced myself out of it: since we hold no
> lock in need_active_balance(), the information we get on the current task
> (and, arguably, on the h_nr_running) is too volatile to be of any use.

But since the lock is released anyway, everything will always be too
volatile in this case.

> I do believe those checks have their place in active_load_balance()'s
> critical section, as that's the most accurate we're going to get. On the
> plus side, if we *do* detect the remote rq's current task isn't CFS, we
> can run detach_one_task() locally, which is an improvement IMO.

This add complexity in the code by adding another path to detach attach task(s).
We could simply bail out and wait the next load balance (which is
already the case sometime) or if you really want to detach a task jump
back to more_balance