Re: wake_wide mechanism clarification

From: Joel Fernandes
Date: Sat Jul 29 2017 - 16:19:09 EST


Hi Mike,

On Sat, Jul 29, 2017 at 8:07 AM, Mike Galbraith
<umgwanakikbuti@xxxxxxxxx> wrote:
> On Sat, 2017-07-29 at 01:01 -0700, Joel Fernandes wrote:
>> Hi Mike,
>>
>> I have take spent some time understanding the email thread and
>> previous discussions. Unfortunately the second condition we are
>> checking for in the wake_wide still didn't make sense to me (mentioned
>> below) :-(
>>
>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>> <umgwanakikbuti@xxxxxxxxx> wrote:
>> > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>> >>
>> >> > That makes sense that we multiply slave's flips by a factor because
>> >> > its low, but I still didn't get why the factor is chosen to be
>> >> > llc_size instead of something else for the multiplication with slave
>> >> > (slave * factor).
>> >
>> >> Yeah I don't know why llc_size was chosen...
<snip>
>> >
>> > The goal of wake wide was to approximate when pulling would be a futile
>> > consolidation effort and counterproductive to scaling. 'course with
>> > ever increasing socket size, any 1:N waker is ever more likely to run
>> > out of CPU for its one and only self (slamming into scaling wall)
>> > before it needing to turn its minions loose to conquer the world.
>>
>> Actually the original question was why do we have the second condition
>> as "master < slave * factor", instead of "master < factor". that's
>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>> master < factor ?
>>
>> Infact, as the factor is set to the llc_size, I think the condition
>> that makes sense to me is:
>>
>> if ((master + slave) < llc_size)
>> return 0;
>
> That says to me turn affine wakeups off for nearly everything.

Ok, I see that now. thanks

>> To explain the second condition above, Michael Wang said the following in [1]
>>
>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>> tasks rely on it, then waker's higher latency will damage all of them, pull
>> wakee seems to be a bad deal."
>
> Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N. Is
> the one flipping partners at least N/s, and the other about N times as
> often? If so, the two may be part of a too big to wisely pull 1:N.
>
> If you have a better idea, by all means, pull it out. Nobody is

Sure yeah, first I'm trying to understand the heuristic itself which
I'm glad to be making progress with thanks to yours and others' help!

> attached to wake_wide(), in fact, I suspect Peter hates it. I'm not
> fond of it either, it having obvious holes. The only thing it has
> going for it is simplicity. Bend it up, replace it, fire away.
>

Ok, it makes much more sense to me now. Also for the N:N case,
wouldn't the excessive wake-affine increase the latency and a
spreading might be better? Say if slave and master flips are much
greater than factor (llc_size), then slave > factor && master < slave
* factor, would probably return true a lot (and we would return 0
causing an affine wakeup). That's probably a bad thing right as it
could overload the waker's CPU quickly? I guess the heuristic tries to
maximize cache-hits more than reduce latency?

>> Again I didn't follow why the second condition couldn't just be:
>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>> Micheal Wang that I quoted.
>> and why he's instead doing the whole multiplication thing there that I
>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>
>> Rephrasing my question in another way, why are we talking the ratio of
>> master/slave instead of the sum when comparing if its > factor? I am
>> surely missing something here.
>
> Because the heuristic tries to not demolish 1:1 buddies. Big partner
> flip delta means the pair are unlikely to be a communicating pair,
> perhaps at high frequency where misses hurt like hell.

But it does seem to me to demolish the N:N communicating pairs from a
latency/load balancing standpoint. For he case of N readers and N
writers, the ratio (master/slave) comes down to 1:1 and we wake
affine. Hopefully I didn't miss something too obvious about that.

thanks,

-Joel