Re: sched: Consequences of integrating the Per Entity Load TrackingMetric into the Load Balancer
From: Mike Galbraith
Date: Sat Jan 05 2013 - 03:13:16 EST
On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:
> Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
> The function of select_idle_sibling() is to place the woken up task in the
> vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says.
> This placement being only in an idle group.If an idle group is not found,the
> fallback cpu is either the waking cpu or the previous cpu accordingly.
> This results in the runqueue of the waking cpu or the previous cpu getting
> overloaded when the system is committed,which is a latency hit to these tasks.
> What is required is that the newly woken up tasks be placed close to the wake
> up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache
> coldness respectively.This is achieved with wake_affine() deciding which
> cache domain the task should be placed on.
> Once this is decided,instead of searching for a completely idle group,let us
> search for the idlest group.This will anyway return a completely idle group
> if it exists and its mechanism will fall back to what select_idle_sibling()
> was doing.But if this fails,find_idlest_group() continues the search for a
> relatively more idle group.
> The argument could be that,we wish to avoid migration of the newly woken up
> task to any other group unless it is completely idle.But in this case, to
> begin with we choose a sched domain,within which a migration could be less
> harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate
> with the same.
Fast movers currently suffer from traversing large package, mostly due
to traversal order walking 1:1 buddies hand in hand across the whole
package endlessly. With only one buddy pair running, it's horrific.
Even if you change the order to be friendlier, perturbation induces
bouncing. More spots to bounce too equals more bouncing. Ergo, I cross
coupled cpu pairs to eliminate that. If buddies are perturbed, having
one and only one buddy cpu pulls them back together, so can't induce a
bounce fest, only correct. That worked well, but had the down side that
some loads really REALLY want maximum spread, so suffer when you remove
migration options as I did. There's in_interrupt() consideration I'm
not so sure of too, in that case, going the extra mile to find an idle
hole to plug _may_ be worth some extra cost too.. dunno.
So wrt integration, what if a buddy cpu were made a FIRST choice of
generic wake balancing vs the ONLY choice of select_idle_sibling() as I
did? If buddy cpu is available, cool, perturbed pairs find each other
and pair back up, if not, and you were here too recently, you stay with
prev_cpu, avoid bounce and traversal at high frequency. All tasks can
try the cheap buddy cpu first, all can try full domain as well, just not
at insane rates. The heavier the short term load average (or such, with
instant decay on longish idle ala idle balance throttle so you ramp
well), the longer the 'forget eating full balance' interval becomes,
with cutoff affecting the cheap but also not free cross coupled buddy
cpu as well at some point. Looking for an idle cpu at hefty load is a
waste of cycles at best, plugging micro-holes does nothing good even if
you find one, forget wake balance entirely at some cutoff, let periodic
balancing do it's thing in peace.
Hrmph, that's too many words, but basically, I think whacking
select_idle_sibling() integration into wake balance makes loads of
sense, but needs a bit more to not end up just moving the problems to a
I still have a 2.6-rt problem I need to find time to squabble with, but
maybe I'll soonish see if what you did plus what I did combined works
out on that 4x10 core box where current is _so_ unbelievably horrible.
Heck, it can't get any worse, and the restricted wake balance alone
kinda sorta worked.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/