Re: wake_wide mechanism clarification

From: Joel Fernandes
Date: Sat Jul 29 2017 - 04:13:59 EST


+Michael Wang on his current email address (old one bounced). (my
reply was to Mike Galbraith but I also meant to CC Michael Wang for
the discussion). Thanks

On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
> Hi Mike,
>
> I have take spent some time understanding the email thread and
> previous discussions. Unfortunately the second condition we are
> checking for in the wake_wide still didn't make sense to me (mentioned
> below) :-(
>
> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
> <umgwanakikbuti@xxxxxxxxx> wrote:
>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>
>>> > That makes sense that we multiply slave's flips by a factor because
>>> > its low, but I still didn't get why the factor is chosen to be
>>> > llc_size instead of something else for the multiplication with slave
>>> > (slave * factor).
>>
>>> Yeah I don't know why llc_size was chosen...
>>
>> static void update_top_cache_domain(int cpu)
>> {
>> struct sched_domain_shared *sds = NULL;
>> struct sched_domain *sd;
>> int id = cpu;
>> int size = 1;
>>
>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> if (sd) {
>> id = cpumask_first(sched_domain_span(sd));
>> size = cpumask_weight(sched_domain_span(sd));
>> sds = sd->shared;
>> }
>>
>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> per_cpu(sd_llc_size, cpu) = size;
>>
>> The goal of wake wide was to approximate when pulling would be a futile
>> consolidation effort and counterproductive to scaling. 'course with
>> ever increasing socket size, any 1:N waker is ever more likely to run
>> out of CPU for its one and only self (slamming into scaling wall)
>> before it needing to turn its minions loose to conquer the world.
>
> Actually the original question was why do we have the second condition
> as "master < slave * factor", instead of "master < factor". that's
> what didn't make sense to me. Why don't we return 0 from wake_wide if
> master < factor ?
>
> Infact, as the factor is set to the llc_size, I think the condition
> that makes sense to me is:
>
> if ((master + slave) < llc_size)
> return 0;
>
> In other words, if the master flips and the slave flips are totally
> higher than the llc_size, then we are most likely waking up too many
> tasks as affine and should then switch to wide to prevent overloading.
>
> Digging further into the original patch from Michael Wang (I also CC'd
> him), this was the code (before you had changed it to master/slave):
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> To explain the second condition above, Michael Wang said the following in [1]
>
> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
> tasks rely on it, then waker's higher latency will damage all of them, pull
> wakee seems to be a bad deal."
>
> Again I didn't follow why the second condition couldn't just be:
> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> wakee->nr_wakee_switch) > factor, based on the above explanation from
> Micheal Wang that I quoted.
> and why he's instead doing the whole multiplication thing there that I
> was talking about earlier: "factor * wakee->nr_wakee_switch".
>
> Rephrasing my question in another way, why are we talking the ratio of
> master/slave instead of the sum when comparing if its > factor? I am
> surely missing something here.
>
> Just taking an example:
>
> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
> between all 3 masters. Also to make it a bit more interesting, let s8
> wake up some random task T0. A diagram to show the master/slave
> relation ships could look like:
>
> +-----+
> | |
> +------------------------+ +------+ M2 |
> | | | | |
> | M1 | | +--+--+----
> | | | | | | |
> | | | | | | s15
> +--+--+--+--+--+--+---+--+---+ v v v
> | | | | | | | | s9 s10 s11
> v v v v v v v v
> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
> ^
> |
> +-+---+
> | |
> | M3 |
> | |
> +--+--+-----
> | | | |
> v v v v
> s12 s13 s14 s16
>
>
> Lets consider the case of M1 waking up s8. As per the above diagram,
> M1 has 8 flips and s8 has 4 flips.
>
> With llc_size = 3, the condition
>
> (slave < factor) would return FALSE, so then we would turn to the
> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
> so wake_wide would return 0 and would cause s8 to be woken up as
> affine with relation to M1's core.
>
> So basically, it seems the heuristic is saying (with help of the
> second condition - master < slave * factor). that Its a good idea for
> s8 to be affine-woken-up with respect to M1's core. Why is it a good
> idea to do that? It seems to me M1 has already several tasks its
> waking as affine so causing s8 to be woken up affine could be harmful
> and it may be a better choice to wake it up elsewhere.
>
> Thanks for your help!
>
> -Joel
>
> [1] https://lkml.org/lkml/2013/7/4/20
>
>
>>
>> Something else to consider: network interrupt waking multiple workers
>> at high frequency. If the waking CPU is idle, do you really want to
>> place a worker directly in front of a tattoo artist, or is it better
>> off nearly anywhere but there?
>>
>> If the box is virtual, with no topology exposed (or real but ancient)
>> to let select_idle_sibling() come to the rescue, two workers can even
>> get tattooed simultaneously (see sync wakeup).
>>
>> -Mike