Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort

From: Vincent Guittot
Date: Fri Oct 29 2021 - 03:49:46 EST


On Fri, 29 Oct 2021 at 01:25, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi, Vincent, Peter,
>
> On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > been removed as the 500us value is quite large and prevent opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
>
> It appears this series is not yet in upstream Linus's tree. What's the latest on it?
>

I sent an addon yesterday to cover cases that Tim cares about

> I see a lot of times on ARM64 devices that load balance is skipped due to the
> high the sysctl_sched_migration_cost. I saw another thread as well where

Have you tested the patchset ? Does it enable more load balance on
your platform ?

> someone complained the performance varies and the default might be too high:
> https://lkml.org/lkml/2021/9/14/150

Added Yicong and Barry in the list

>
> Any other thoughts? Happy to help on any progress on this series as well. Thanks,
>
> - Joel
>
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> > min avg max
> > SMT: 1us 33us 273us - this one includes the update of blocked load
> > MC: 7us 49us 398us
> > NUMA: 10us 45us 158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group tip/sched/core + this patchset
> > 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> > 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> > 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> > 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> > 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> > 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> > 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
> >
> > tbench and reaim have not shown any difference
> >
> > Change since v2:
> > - Update and decay of sd->last_decay_max_lb_cost are gathered in
> > update_newidle_cost(). The behavior remains almost the same except that
> > the decay can happen during newidle_balance now.
> >
> > Tests results haven't shown any differences
> >
> > I haven't modified rq->max_idle_balance_cost. It acts as the max value
> > for avg_idle and prevents the latter to reach high value during long
> > idle phase. Moving on an IIR filter instead, could delay the convergence
> > of avg_idle to a reasonnable value that reflect current situation.
> >
> > - Added a minor cleanup of newidle_balance
> >
> > Change since v1:
> > - account the time spent in update_blocked_averages() in the 1st domain
> >
> > - reduce number of call of sched_clock_cpu()
> >
> > - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> > IIR but keeping a track of the current max value gave the best result
> >
> > - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> > as suggested by Peter
> >
> > Vincent Guittot (5):
> > sched/fair: Account update_blocked_averages in newidle_balance cost
> > sched/fair: Skip update_blocked_averages if we are defering load
> > balance
> > sched/fair: Wait before decaying max_newidle_lb_cost
> > sched/fair: Remove sysctl_sched_migration_cost condition
> > sched/fair: cleanup newidle_balance
> >
> > include/linux/sched/topology.h | 2 +-
> > kernel/sched/fair.c | 65 ++++++++++++++++++++++------------
> > kernel/sched/topology.c | 2 +-
> > 3 files changed, 45 insertions(+), 24 deletions(-)
> >
> > --
> > 2.17.1
> >