Re: [PATCH v2] sched/fair: Age the average idle time

From: Vincent Guittot
Date: Thu Jun 17 2021 - 12:06:00 EST


On Thu, 17 Jun 2021 at 17:48, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, Jun 17, 2021 at 05:03:54PM +0200, Vincent Guittot wrote:
> > On Thu, 17 Jun 2021 at 13:05, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Jun 17, 2021 at 12:02:56PM +0200, Vincent Guittot wrote:
> > > > > > >
> > > > > > > Fundamentally though, as the changelog notes "due to the nature of the
> > > > > > > patch, this is a regression magnet". There are going to be examples
> > > > > > > where a deep search is better even if a machine is fully busy or
> > > > > > > overloaded and examples where cutting off the search is better. I think
> > > > > > > it's better to have an idle estimate that gets updated if CPUs are fully
> > > > > > > busy even if it's not a universal win.
> > > > > >
> > > > > > Although I agree that using a stall average idle time value of local
> > > > > > is not good, I'm not sure this proposal is better. The main problem is
> > > > > > that we use the avg_idle of the local CPU to estimate how many times
> > > > > > we should loop and try to find another idle CPU. But there is no
> > > > > > direct relation between both.
> > > > >
> > > > > This is true. The idle time of the local CPU is used to estimate the
> > > > > idle time of the domain which is inevitably going to be inaccurate but
> > > >
> > > > I'm more and more convinced that using average idle time (of the
> > > > local cpu or the full domain) is not the right metric. In
> > > > select_idle_cpu(), we looks for an idle CPU but we don't care about
> > > > how long it will be idle.
> > >
> > > Can we predict that accurately? cpufreq for intel_pstate used to try
> > > something like that but it was a bit fuzzy and I don't know if the
> > > scheduler could do much better. There is some idle prediction stuff but
> > > it's related to nohz which does not really help us if a machine is nearly
> > > fully busy or overloaded.
> > >
> > > I guess for tracking idle that revisiting
> > > https://lore.kernel.org/lkml/1615872606-56087-1-git-send-email-aubrey.li@xxxxxxxxx/
> > > is an option now that the scan is somewhat unified. A two-pass scan
> > > could be used to check potentially idle CPUs first and if there is
> > > sufficient search depth left, scan other CPUs. There were some questions
> >
> > I think it's the other way around:
> > a CPU is busy for sure if it is not set in the cpuidle_mask and we
> > don't need to check it. But a cpu might not be idle even if it is set
> > in the idle mask might because it's cleared during the tick
> >
>
> Tick is a long time so scan depth may still be a problem.

Thinking a bit more on the cpuidle_mask patch, I'm not sure that we
really need to scale the depth of the scan according to avg_idle and
avg_scane_cost. A fixed value (which one ? might be the tricky point)
should be enough. The reason for using a fix number of loop is:
- ideally cpuidle_mask has only idle CPUs which means that the 1st
loop should return an idle CPU
- but we might have some false idle CPUs in the mask when we have UC
with a lot of wakeup/sleep task
- In this case, we will start to loop more than once on the
cpuidle_mask because CPUs are not idle. We can also consider that if
the 1st X CPUs are false idle, the probability that next ones will be
false idle too is significant and it isn't worth looking at further .
- Currently, we limit the number of loop to not delay too much the
wakeup path and to not stole too much time on the task running on
local cpu so we could at least check if the local cpu is idle and
select a fix number of loop that doesn't harm too much the wakeup
latency

I'm going to play a bit with this


>
> > > Selecting based on avg idle time could be interesting but hazardous. If
> > > for example, we prioritised selecting a CPU that is mostly idle, it'll
> > > also pick CPUs that are potentially in a deep idle state incurring a
> > > larger wakeup cost. Right now we are not much better because we just
> > > select an idle CPU and hope for the best but always targetting the most
> > > idle CPU could have problems. There would also be the cost of tracking
> > > idle CPUs in priority order. It would eliminate the scan depth cost
> > > calculations but the overall cost would be much worse.
> > >
> > > Hence, I still think we can improve the scan depth costs in the short
> > > term until a replacement is identified that works reasonably well.
> > >
> > > > Even more, we can scan all CPUs whatever the
> > > > avg idle time if there is a chance that there is an idle core.
> > > >
> > >
> > > That is an important, but separate topic. It's known that the idle core
> > > detection can yield false positives. Putting core scanning under SIS_PROP
> > > had mixed results when we last tried but things change. Again, it doesn't
> > > help with scan depth calculations.
> >
> > my point was mainly to highlight that the path can take opposite
> > decision for the same avg_idle value:
> > - scan all cpus if has_idle_core is true whatever avg_idle
> > - limit the depth if has_idle_core is false and avg_idle is short
> >
>
> I do understand the point but the idle core scan anomaly was not
> intended to be addressed in the patch because putting the idle scan
> under SIS_PROP potentially means using cpus with active idle siblings
> prematurely.
>
> > >
> > > > > tracking idle time for the domain will be cache write intensive and
> > > > > potentially very expensive. I think this was discussed before but maybe
> > > > > it is my imaginaction.
> > > > >
> > > > > > Typically, a short average idle time on
> > > > > > the local CPU doesn't mean that there are less idle CPUs and that's
> > > > > > why we have a mix a gain and loss
> > > > > >
> > > > >
> > > > > Can you evaluate if scanning proportional to cores helps if applied on
> > > > > top? The patch below is a bit of pick&mix and has only seen a basic build
> > > >
> > > > I will queue it for some test later today
> > > >
> > >
> > > Thanks. The proposed patch since passed a build and boot test,
> > > performance evaluation is under way but as it's x86 and SMT2, I'm mostly
> > > just checking that it's neutral.
> >
> > Results stay similar:
> > group tip/sched/core + this patch + latest addon
> > 1 13.358(+/- 1.82%) 12.850(+/- 2.21%) +4% 13.411(+/- 2.47%) -0%
> > 4 4.286(+/- 2.77%) 4.114(+/- 2.25%) +4% 4.163(+/- 1.88%) +3%
> > 16 3.175(+/- 0.55%) 3.559(+/- 0.43%) -12% 3.535(+/- 0.52%) -11%
> > 32 2.912(+/- 0.79%) 3.165(+/- 0.95%) -8% 3.153(+/- 0.76%) -10%
> > 64 2.859(+/- 1.12%) 2.937(+/- 0.91%) -3% 2.919(+/- 0.73%) -2%
> > 128 3.092(+/- 4.75%) 3.003(+/-5.18%) +3% 2.973(+/- 0.90%) +4%
> > 256 3.233(+/- 3.03%) 2.973(+/- 0.80%) +8% 3.036(+/- 1.05%) +6%
> >
>
> Ok, accounting for SMT4 didn't help.
>
> --
> Mel Gorman
> SUSE Labs