Re: sysbench throughput degradation in 4.13+

From: Rik van Riel
Date: Wed Oct 04 2017 - 14:02:55 EST

On Wed, 2017-10-04 at 18:18 +0200, Peter Zijlstra wrote:
> On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> > So I was waiting for Rik, who promised to run a bunch of NUMA
> > workloads
> > over the weekend.
> >
> > The trivial thing regresses a wee bit on the overloaded case, I've
> > not
> > yet tried to fix it.
> WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
> addition -- based on the old scheme -- that I've tried in order to
> lift
> the overloaded case (including hackbench).
> Its not an unconditional win, but I'm tempted to default enable
> WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs).

Enabling both makes sense to me.

We have four cases to deal with:
- mostly idle system, in that case we don't really care,
since select_idle_sibling will find an idle core anywhere
- partially loaded system (say 1/2 or 2/3 full), in that case
WA_IDLE will be a great policy to help locate an idle CPU
- fully loaded system, in this case either policy works well
- overloaded system, in this case WA_WEIGHT seems to do the
trick, assuming load balancing results in largely similar
loads between cores inside each LLC

The big danger is affine wakeups messing up the balance
the load balancer works on, with the two mechanisms
messing up each other's placement.

However, there seems to be very little we can actually
do about that, without the unacceptable overhead of
examining the instantaneous loads on every CPU in an
LLC - otherwise we end up either overshooting, or not
taking advantage of idle CPUs, due to the use of cached
load values.

All rights reversed

Attachment: signature.asc
Description: This is a digitally signed message part