Re: wake_wide mechanism clarification

From: Joel Fernandes
Date: Sat Jul 29 2017 - 04:01:52 EST

Next message: Masami Hiramatsu: "[PATCH] [BUGFIX] gpio: reject invalid gpio before getting gpio_desc"
Previous message: SZ Lin: "[PATCH 0/5] tpm: tpm_ibmvtpm: - style fix"
Next in thread: Joel Fernandes: "Re: wake_wide mechanism clarification"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Mike,

I have take spent some time understanding the email thread and
previous discussions. Unfortunately the second condition we are
checking for in the wake_wide still didn't make sense to me (mentioned
below) :-(

On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
<umgwanakikbuti@xxxxxxxxx> wrote:
> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>
>> > That makes sense that we multiply slave's flips by a factor because
>> > its low, but I still didn't get why the factor is chosen to be
>> > llc_size instead of something else for the multiplication with slave
>> > (slave * factor).
>
>> Yeah I don't know why llc_size was chosen...
>
> static void update_top_cache_domain(int cpu)
> {
> struct sched_domain_shared *sds = NULL;
> struct sched_domain *sd;
> int id = cpu;
> int size = 1;
>
> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
> sds = sd->shared;
> }
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
>
> The goal of wake wide was to approximate when pulling would be a futile
> consolidation effort and counterproductive to scaling. 'course with
> ever increasing socket size, any 1:N waker is ever more likely to run
> out of CPU for its one and only self (slamming into scaling wall)
> before it needing to turn its minions loose to conquer the world.

Actually the original question was why do we have the second condition
as "master < slave * factor", instead of "master < factor". that's
what didn't make sense to me. Why don't we return 0 from wake_wide if
master < factor ?

Infact, as the factor is set to the llc_size, I think the condition
that makes sense to me is:

if ((master + slave) < llc_size)
return 0;

In other words, if the master flips and the slave flips are totally
higher than the llc_size, then we are most likely waking up too many
tasks as affine and should then switch to wide to prevent overloading.

Digging further into the original patch from Michael Wang (I also CC'd
him), this was the code (before you had changed it to master/slave):

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

To explain the second condition above, Michael Wang said the following in [1]

"Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal."

Again I didn't follow why the second condition couldn't just be:
waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
wakee->nr_wakee_switch) > factor, based on the above explanation from
Micheal Wang that I quoted.
and why he's instead doing the whole multiplication thing there that I
was talking about earlier: "factor * wakee->nr_wakee_switch".

Rephrasing my question in another way, why are we talking the ratio of
master/slave instead of the sum when comparing if its > factor? I am
surely missing something here.

Just taking an example:

Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
between all 3 masters. Also to make it a bit more interesting, let s8
wake up some random task T0. A diagram to show the master/slave
relation ships could look like:

+-----+
| |
+------------------------+ +------+ M2 |
| | | | |
| M1 | | +--+--+----
| | | | | | |
| | | | | | s15
+--+--+--+--+--+--+---+--+---+ v v v
| | | | | | | | s9 s10 s11
v v v v v v v v
s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
^
|
+-+---+
| |
| M3 |
| |
+--+--+-----
| | | |
v v v v
s12 s13 s14 s16

Lets consider the case of M1 waking up s8. As per the above diagram,
M1 has 8 flips and s8 has 4 flips.

With llc_size = 3, the condition

(slave < factor) would return FALSE, so then we would turn to the
(master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
so wake_wide would return 0 and would cause s8 to be woken up as
affine with relation to M1's core.

So basically, it seems the heuristic is saying (with help of the
second condition - master < slave * factor). that Its a good idea for
s8 to be affine-woken-up with respect to M1's core. Why is it a good
idea to do that? It seems to me M1 has already several tasks its
waking as affine so causing s8 to be woken up affine could be harmful
and it may be a better choice to wake it up elsewhere.

Thanks for your help!

-Joel

[1] https://lkml.org/lkml/2013/7/4/20

>
> Something else to consider: network interrupt waking multiple workers
> at high frequency. If the waking CPU is idle, do you really want to
> place a worker directly in front of a tattoo artist, or is it better
> off nearly anywhere but there?
>
> If the box is virtual, with no topology exposed (or real but ancient)
> to let select_idle_sibling() come to the rescue, two workers can even
> get tattooed simultaneously (see sync wakeup).
>
> -Mike

Next message: Masami Hiramatsu: "[PATCH] [BUGFIX] gpio: reject invalid gpio before getting gpio_desc"
Previous message: SZ Lin: "[PATCH 0/5] tpm: tpm_ibmvtpm: - style fix"
Next in thread: Joel Fernandes: "Re: wake_wide mechanism clarification"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]