Re: [lkp-robot] [sched/fair] 6d46bd3d97: netperf.Throughput_tps -11.3% regression

From: Mike Galbraith
Date: Mon Sep 18 2017 - 01:31:24 EST

Next message: Leon Romanovsky: "Re: [PATCH] IB/mlx5: fix debugfs cleanup"
Previous message: kbuild test robot: "Re: [PATCH V5 3/3] scsi: Align queue to ARCH_DMA_MINALIGN in non-coherent DMA mode"
In reply to: Joel Fernandes: "Re: [lkp-robot] [sched/fair] 6d46bd3d97: netperf.Throughput_tps -11.3% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, 2017-09-17 at 14:41 -0700, Joel Fernandes wrote:
> Hi Mike,
>
> On Sun, Sep 17, 2017 at 9:47 AM, Mike Galbraith <efault@xxxxxx> wrote:
> > On Sat, 2017-09-16 at 23:42 -0700, Joel Fernandes wrote:
> >>
> >> Yes I understand. However with my 'strong sync' patch, such a
> >> balancing check could be useful which is what I was trying to do in a
> >> different way in my patch - but it could be that my way is not good
> >> enough and potentially the old wake_affine check could help here
> >
> > Help how? The old wake_affine() check contained zero concurrency
> > information, it served to exclude excessive stacking, defeating the
> > purpose of SMP. A truly synchronous wakeup has absolutely nothing to
> > do with load balance in the general case: you can neither generate nor
> > cure an imbalance by replacing one (nice zero) task with another. The
> > mere existence of a load based braking mechanism speaks volumes.
>
> This is the part I didn't get.. you said "neither generate an
> imbalance", it is possible that a CPU with a high blocked load but
> just happen to be running 1 task at the time and did a sync wake up
> for another task. In this case dragging the task onto the waker's CPU
> might hurt it since it will face more competition than if it were
> woken up on its previous CPU which is possibly lighty loaded than the
> waker's CPU?

Sure, a preexisting imbalance, but what does that have to do with
generic pass the baton behavior, and knowing whether pull will result
in a performance win or loss? ÂIn the overloaded and imbalanced case, a
lower load number on this/that CPU means little in and of itself wrt
latency expectations... the wakee may or may not be next in line, may
or may not be able to cut the line (preempt).

> Also the other thing I didn't fully get is why is concurrency a
> discussion point here, in this case A wakes up B and goes to sleep,
> and then B wakes up A. They never run concurrently. Could you let me
> know what I am missing?

If waker does not IMMEDIATELY sleep, and you wake CPU affine, you toss
the time interval between wakeup and context switch out the window, the
wakee could have already been executing elsewhere when the waker gets
around to getting the hell out of the way.

The point I'm belaboring here is that we had that load check and more,
and it proved insufficient. ÂYeah, pipe-test absolutely loves running
on one CPU, but who cares, pipe-test is an overhead measurement tool,
NOT a benchmark, those try to emulate the real world.

Hell, take two large footprint tasks, and let them exchange data
briefly via pipe.. what would make it a wonderful idea to pull these
tasks together each and every time they did that? ÂBecause pipe-test
can thus play super fast ping-pong with itself? ÂObviously not.

> >> On systems with SMT, it may make more sense for
> >> > sync wakeups to look for idle threads of the same
> >> > core, than to have the woken task end up on the
> >> > same thread, and wait for the current task to stop
> >> > running.
> >>
> >> I am ok with additionally doing an select_idle_smt for the SMT cases.
> >> However Mike shows that it doesn't necessarily cause a performance
> >> improvement. But if there is consensus on checking for idle SMT
> >> threads, then I'm Ok with doing that.
> >
> > select_idle_sibling() used to check thread first, that was changed to
> > core first for performance reasons.
>
> Oh I see. Ok.
>
> >> > "Strong sync" wakeups like you propose would also
> >> > change the semantics of wake_wide() and potentially
> >> > other bits of code...
> >> >
> >>
> >> I understand, I am not very confident that wake_wide does the right
> >> thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
> >> the most common usecase of display pipeline well. It seems that we
> >> have cases where the 'flip count' is really high and causes wake_wide
> >> all the time and sends us straight to the wake up slow path causing
> >> regressions in Android benchmarks.
> >
> > Hm. It didn't pull those counts out of the vacuum, it measured them.
> > It definitely does not force Android into the full balance path, that
> > is being done by Android developers, as SD_BALANCE_WAKE is off by
> > default. It was briefly on by default, but was quickly turned back off
> > because it... induced performance regressions.
> >
> > In any case, if you have cause to believe that wake_wide() is causing
> > you grief, why the heck are you bending up the sync hint?
>
> So its not just wake_wide causing the grief, even select_idle_sibling
> doesn't seem to be doing the right thing. We really don't want to wake
> up a task on an idle CPU if the current CPU is a better candidate.

That is the nut, defining better candidate is not so trivial. ÂLacking
concrete knowledge (omniscience?), you're always gambling, betting on
concurrency potential existing is much safer than betting against it if
your prime directive is to maximize utilization.

> Binder microbenchmarks should that (as you can see in the results in
> this patch) it performs better to wake up on the current CPU (no wake
> up from idle latency on a sibling etc). Perhaps we should fix
> select_idle_sibling to also consider latency of CPU to come out of
> idle?

That would add complexity. ÂIf you're gonna do that, you may as well go
the extra mile, nuke it, and unify the full balance path instead.

> Using the sync hint was/is a way on the product kernel to
> prevent both these paths. On current products, sync is ignored though
> if the system is in an "overutilized" state (any of the CPUs are more
> than 80% utilized) but otherwise the sync is used as a hard flag. This
> is all probably wrong though - considering the concurrency point you
> brought up..

My experience with this space was that it's remarkably easy to wreck
performance, why I was inspired to bang on podium with shoe. ÂI'm not
trying to discourage your explorations, merely passing along a well
meant "Danger Will Robinson".

-Mike

Next message: Leon Romanovsky: "Re: [PATCH] IB/mlx5: fix debugfs cleanup"
Previous message: kbuild test robot: "Re: [PATCH V5 3/3] scsi: Align queue to ARCH_DMA_MINALIGN in non-coherent DMA mode"
In reply to: Joel Fernandes: "Re: [lkp-robot] [sched/fair] 6d46bd3d97: netperf.Throughput_tps -11.3% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]