Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
From: Peter Zijlstra
Date: Mon Feb 05 2018 - 07:49:15 EST
On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote:
> On 2/2/2018 2:59 PM, Peter Zijlstra wrote:
> > But then you get that atomic crud to contend on the cluster level, which
> > is even worse than it contending on the core level.
>
> True, but it can still be a net win if we make better scheduling decisions.
> A saving grace is that the atomic counter is only updated if the cpu
> makes a transition from idle to busy or vice versa.
Which can still be a very high rate for some workloads. I always forget
which, but there are plenty workloads that have very frequenct very
short idle times. Mike, do you remember what comes apart when we take
out the sysctl_sched_migration_cost test in idle_balance()?
> We need data for this type of system, showing improvements for normal
> workloads, and showing little downside for a high context switch rate
> torture test.
So core-wide atomics should, on architectures that can do atomics in L1,
be relatively fast. Once you leave L1, atomic contention goes up a fair
bit. And then there's architectures that simply don't do atomics in L1
(like Power).
Testing on my SKL desktop, atomics contending between SMT siblings is
basically free (weirdly enough my test says its cheaper), atomics
contending on the L3 is 10x as expensive, and this is with only 4 cores.
If I test it on my 10 core IVB, I'm up to 20x, and I can't imagine that
getting any better with bigger core count (my IVB does not show SMT
contention as lower, but not particularly more expensive either).
So while I see the point of tracking these numbers (for SMT>2), I don't
think its worth doing outside of the core, and then we still need some
powerpc (or any other architecture with abysmal atomics) tested.
So what we can do is make this counting crud conditional on SMT>2 and
possibly part of the topology flags such that an architecture can
opt-out.
Then select_idle_core can be augmented to remember the least-loaded core
it encounters in its traversal, and go with that.