Re: [PATCH 4/4] sched/fair: Track possibly overloaded domains and abort a scan if necessary

From: Vincent Guittot
Date: Tue Mar 24 2020 - 06:35:25 EST


On Fri, 20 Mar 2020 at 18:43, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Mar 20, 2020 at 05:54:57PM +0100, Vincent Guittot wrote:
> > On Fri, 20 Mar 2020 at 17:44, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Mar 20, 2020 at 04:48:39PM +0100, Vincent Guittot wrote:
> > > > > ---
> > > > > include/linux/sched/topology.h | 1 +
> > > > > kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++---
> > > > > kernel/sched/features.h | 3 ++
> > > > > 3 files changed, 65 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > > > index af9319e4cfb9..76ec7a54f57b 100644
> > > > > --- a/include/linux/sched/topology.h
> > > > > +++ b/include/linux/sched/topology.h
> > > > > @@ -66,6 +66,7 @@ struct sched_domain_shared {
> > > > > atomic_t ref;
> > > > > atomic_t nr_busy_cpus;
> > > > > int has_idle_cores;
> > > > > + int is_overloaded;
> > > >
> > > > Can't nr_busy_cpus compared to sd->span_weight give you similar status ?
> > > >
> > >
> > > It's connected to nohz balancing and I didn't see how I could use that
> > > for detecting overload. Also, I don't think it ever can be larger than
> > > the sd weight and overload is based on the number of running tasks being
> > > greater than the number of available CPUs. Did I miss something obvious?
> >
> > IIUC you try to estimate if there is a chance to find an idle cpu
> > before starting the loop and scanning the domain and abort early if
> > the possibility is low.
> >
> > if nr_busy_cpus equals to sd->span_weight it means that there is no
> > free cpu so there is no need to scan
> >
>
> Ok, I see what you are getting at but I worry there are multiple
> problems there. First, the nr_busy_cpus is decremented only when a CPU
> is entering idle with the tick stopped. If nohz is disabled then this
> breaks, no? Secondly, a CPU can be idle but the tick not stopped if

But this can be changed if that make the statistic useful

> __tick_nohz_idle_stop_tick knows there is an event in the near future
> so using busy_cpus, we potentially miss a sibling that was adequate
> for running a task. Finally, the threshold for cutting off the search
> entirely seems low. The patch marks a domain as overloaded if there are
> twice as many running tasks as runqueues scanned. In that scenario, even
> if tasks are rapidly switching between busy/idle, it's still unlikely
> the task will go idle. When cutting off at just the fully-busy mark, we
> could miss a CPU that is going idle, almost idle or is running SCHED_IDLE
> tasks where are acceptable target candidates for select_idle_sibling. I
> think there are too many cases where nr_busy_cpus are problematic to
> make it a good alternative.

I don't really like this patch because it adds yet another metrics and
yet another feature which is set true by default. Also the current
proposal seems a bit fragile because it uses an arbitrary ratio of 2
on an arbitrary number of CPUs. This threshold probably works in your
case and your system but probably not for others and the threshold
really looks like a heuristic that works for you but without any real
meaning.

Then, the update is done at each and every task wake up and by all
CPUs in the LLC. It means that the same variable is updated
simultaneously by all CPUs: one CPU can set it and the next one might
clear it immediately because they haven't scanned the same CPUs. At
the end, 2 threads waking up simultaneously on different CPUS, might
end up using 2 different policy without any other reason than a random
ordering.

I agree that the concept of detecting that a LLC domain is overloaded
can be useful to decide to skip searching for an idle cpu but this
proposal seems to be not really generic

Vincent

>
> --
> Mel Gorman
> SUSE Labs