Re: [PATCH v2] sched/fair: Age the average idle time

From: Vincent Guittot
Date: Thu Jun 17 2021 - 04:30:25 EST


On Thu, 17 Jun 2021 at 09:44, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Jun 16, 2021 at 05:52:25PM +0200, Vincent Guittot wrote:
> > On Tue, 15 Jun 2021 at 22:43, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, Jun 15, 2021 at 12:16:11PM +0100, Mel Gorman wrote:
> > > > From: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> > > >
> > > > This is a partial forward-port of Peter Ziljstra's work first posted
> > > > at https://lore.kernel.org/lkml/20180530142236.667774973@xxxxxxxxxxxxx/.
> > >
> > > It's patches 2 and 3 together, right?
> > >
> > > > His Signed-off has been removed because it is modified but will be restored
> > > > if he says it's still ok.
> > >
> > > I suppose the SoB will auto-magically re-appear if I apply it :-)
> > >
> > > > The patch potentially matters when a socket was multiple LLCs as the
> > > > maximum search depth is lower. However, some of the test results were
> > > > suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and
> > > > other results were not dramatically different to other mcahines.
> > > >
> > > > Given the nature of the patch, Peter's full series is not being forward
> > > > ported as each part should stand on its own. Preferably they would be
> > > > merged at different times to reduce the risk of false bisections.
> > >
> > > I'm tempted to give it a go.. anyone object?
> >
> > Just finished running some tests on my large arm64 system.
> > Tbench tests are a mixed between small gain and loss
> >
>
> Same for tbench on three x86 machines I reran tests for
>
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/network-tbench/bing2/index.html#tbench4
> Small gains and losses, gains at higher client counts where search depth
> should be reduced
>
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/network-tbench/hardy2/index.html#tbench4
> Mostly gains, one counter-example at 4 clients
>
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/network-tbench/marvin2/index.html#tbench4
> Worst by far, 1 client took a major hit for unknown reasons, otherwise
> mix of gains and losses. I'm not confident that the 1 client
> results are meaningful because for this machine, there should
> have been idle cores so the code the patch adjusts should not
> even be executed.
>
> > hackbench shows significant changes in both direction
> > hackbench -g $group
> >
> > group tip/sched/core + this patch
> > 1 13.358(+/- 1.82%) 12.850(+/- 2.21%) +4%
> > 4 4.286(+/- 2.77%) 4.114(+/- 2.25%) +4%
> > 16 3.175(+/- 0.55%) 3.559(+/- 0.43%) -12%
> > 32 2.912(+/- 0.79%) 3.165(+/- 0.95%) -8%
> > 64 2.859(+/- 1.12%) 2.937(+/- 0.91%) -3%
> > 128 3.092(+/- 4.75%) 3.003(+/-5.18%) +3%
> > 256 3.233(+/- 3.03%) 2.973(+/- 0.80%) +8%
>
> Think this is processes and sockets. Of the hackbench results I had,
> this one performed the worst
>
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/scheduler-unbound/bing2/index.html#hackbench-process-sockets
> Small gains and losses
>
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/scheduler-unbound/hardy2/index.html#hackbench-process-sockets
> Small gains and losses
>
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/scheduler-unbound/marvin2/index.html#hackbench-process-sockets
> Small gains and losses
>
> One of the better results for hackbench was processes and pipes
> https://beta.suse.com/private/mgorman/melt/v5.13-rc5/3-perf-test/sched/sched-avgidle-v1r6/html/scheduler-unbound/bing2/index.html#hackbench-process-pipes
> 1-12% gains
>
> For your arm machine, how many logical CPUs are online, what is the level
> of SMT if any and is the machine NUMA?

It's a SMT4 x 28 cores x 2 NUMA nodes = 224 CPUs

>
> Fundamentally though, as the changelog notes "due to the nature of the
> patch, this is a regression magnet". There are going to be examples
> where a deep search is better even if a machine is fully busy or
> overloaded and examples where cutting off the search is better. I think
> it's better to have an idle estimate that gets updated if CPUs are fully
> busy even if it's not a universal win.

Although I agree that using a stall average idle time value of local
is not good, I'm not sure this proposal is better. The main problem is
that we use the avg_idle of the local CPU to estimate how many times
we should loop and try to find another idle CPU. But there is no
direct relation between both. Typically, a short average idle time on
the local CPU doesn't mean that there are less idle CPUs and that's
why we have a mix a gain and loss

>
> --
> Mel Gorman
> SUSE Labs