Re: [PATCH v2] sched/fair: Age the average idle time

From: Mel Gorman
Date: Thu Jun 17 2021 - 07:06:20 EST


On Thu, Jun 17, 2021 at 12:02:56PM +0200, Vincent Guittot wrote:
> > > >
> > > > Fundamentally though, as the changelog notes "due to the nature of the
> > > > patch, this is a regression magnet". There are going to be examples
> > > > where a deep search is better even if a machine is fully busy or
> > > > overloaded and examples where cutting off the search is better. I think
> > > > it's better to have an idle estimate that gets updated if CPUs are fully
> > > > busy even if it's not a universal win.
> > >
> > > Although I agree that using a stall average idle time value of local
> > > is not good, I'm not sure this proposal is better. The main problem is
> > > that we use the avg_idle of the local CPU to estimate how many times
> > > we should loop and try to find another idle CPU. But there is no
> > > direct relation between both.
> >
> > This is true. The idle time of the local CPU is used to estimate the
> > idle time of the domain which is inevitably going to be inaccurate but
>
> I'm more and more convinced that using average idle time (of the
> local cpu or the full domain) is not the right metric. In
> select_idle_cpu(), we looks for an idle CPU but we don't care about
> how long it will be idle.

Can we predict that accurately? cpufreq for intel_pstate used to try
something like that but it was a bit fuzzy and I don't know if the
scheduler could do much better. There is some idle prediction stuff but
it's related to nohz which does not really help us if a machine is nearly
fully busy or overloaded.

I guess for tracking idle that revisiting
https://lore.kernel.org/lkml/1615872606-56087-1-git-send-email-aubrey.li@xxxxxxxxx/
is an option now that the scan is somewhat unified. A two-pass scan
could be used to check potentially idle CPUs first and if there is
sufficient search depth left, scan other CPUs. There were some questions
on how accurate the idle mask was and how expensive it was to maintain.
Unfortunately, it would not help with scan depth calculations, it just
might reduce useless scanning.

Selecting based on avg idle time could be interesting but hazardous. If
for example, we prioritised selecting a CPU that is mostly idle, it'll
also pick CPUs that are potentially in a deep idle state incurring a
larger wakeup cost. Right now we are not much better because we just
select an idle CPU and hope for the best but always targetting the most
idle CPU could have problems. There would also be the cost of tracking
idle CPUs in priority order. It would eliminate the scan depth cost
calculations but the overall cost would be much worse.

Hence, I still think we can improve the scan depth costs in the short
term until a replacement is identified that works reasonably well.

> Even more, we can scan all CPUs whatever the
> avg idle time if there is a chance that there is an idle core.
>

That is an important, but separate topic. It's known that the idle core
detection can yield false positives. Putting core scanning under SIS_PROP
had mixed results when we last tried but things change. Again, it doesn't
help with scan depth calculations.

> > tracking idle time for the domain will be cache write intensive and
> > potentially very expensive. I think this was discussed before but maybe
> > it is my imaginaction.
> >
> > > Typically, a short average idle time on
> > > the local CPU doesn't mean that there are less idle CPUs and that's
> > > why we have a mix a gain and loss
> > >
> >
> > Can you evaluate if scanning proportional to cores helps if applied on
> > top? The patch below is a bit of pick&mix and has only seen a basic build
>
> I will queue it for some test later today
>

Thanks. The proposed patch since passed a build and boot test,
performance evaluation is under way but as it's x86 and SMT2, I'm mostly
just checking that it's neutral.

--
Mel Gorman
SUSE Labs