Re: wake_wide mechanism clarification

From: Atish Patra
Date: Wed Aug 09 2017 - 17:23:27 EST

Next message: H.J. Lu: "Re: new ELF marking"
Previous message: Hisashi T Fujinaka: "Re: [Intel-wired-lan] [regression] wake on lan no longer works in 4.13-rc3. was Re: Linux 4.13: Reported regressions as of Sunday, 2017-08-06"
Next in thread: Brendan Jackman: "Re: wake_wide mechanism clarification"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 08/03/2017 10:05 AM, Brendan Jackman wrote:

On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:

On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:

Hi,

On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:

On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:

On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:

On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:

That makes sense that we multiply slave's flips by a factor because
its low, but I still didn't get why the factor is chosen to be
llc_size instead of something else for the multiplication with slave
(slave * factor).

Yeah I don't know why llc_size was chosen...

static void update_top_cache_domain(int cpu)
{
struct sched_domain_shared *sds = NULL;
struct sched_domain *sd;
int id = cpu;
int size = 1;

sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
sds = sd->shared;
}

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;

The goal of wake wide was to approximate when pulling would be a futile
consolidation effort and counterproductive to scaling. 'course with
ever increasing socket size, any 1:N waker is ever more likely to run
out of CPU for its one and only self (slamming into scaling wall)
before it needing to turn its minions loose to conquer the world.

Something else to consider: network interrupt waking multiple workers
at high frequency. If the waking CPU is idle, do you really want to
place a worker directly in front of a tattoo artist, or is it better
off nearly anywhere but there?

If the box is virtual, with no topology exposed (or real but ancient)
to let select_idle_sibling() come to the rescue, two workers can even
get tattooed simultaneously (see sync wakeup).

Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
big of a hammer, we probably need a middle ground. I'm messing with it right
now so it's too early to say for sure, but i _suspect_ the bigger latencies we
see are not because we overload the cpu we're trying to pull to, but because
when we fail to do the wake_affine() we only look at siblings of the affine_sd
instead of doing the full "find the idlest cpu in the land!" thing.

This is the problem I've been hitting lately. My use case is 1 task per
CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
task per CPU, they all do X amount of work then pthread_barrier_wait
(i.e. sleep until the last task finishes its X and hits the barrier). On
big.LITTLE, the tasks which get a "big" CPU finish faster, and then
those CPUs pull over the tasks that are still running:

v CPU v ->time->

-------------
0 (big) 11111 /333
-------------
1 (big) 22222 /444|
-------------
2 (LITTLE) 333333/
-------------
3 (LITTLE) 444444/
-------------

Now when task 4 hits the barrier (at |) and wakes the others up, there
are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
regardless of wake_affine, want_affine means that we'll only look in
sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
bigs until the next load balance, something like this:

v CPU v ->time->

------------------------
0 (big) 11111 /333 31313\33333
------------------------
1 (big) 22222 /444|424\4444444
------------------------
2 (LITTLE) 333333/ \222222
------------------------
3 (LITTLE) 444444/ \1111
------------------------
^^^
underutilization

I _think_
the answer is to make select_idle_sibling() try less hard to find something
workable and only use obviously idle cpu's in the affine sd, and fall back to
the full load balance esque search.

So this idea of allowing select_idle_sibling to fail, and falling back
to the slow path, would help me too, I think.

Unfortunately this statement of mine was wrong, I had it in my head that we
would fall back to a find the idlest cpu thing provided we failed to wake
affine, but we just do select_idle_sibling() and expect the load balancer to
move things around as needed.

Ah yes, when wake_affine() returns false, we still do
select_idle_sibling (except in prev_cpu's sd_llc instead of
smp_processor_id()'s), and that is the problem faced by my workload. I
thought you were suggesting to change the flow so that
select_idle_sibling can say "I didn't find any idle siblings - go to the
find_idlest_group path".

This is also why I was playing with your
don't-affine-recently-balanced-tasks patch[1], which also helps my case
since it prevents want_affine for tasks 3 and 4 (which were recently
moved by an active balance).

[1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
(also linked elsewhere in this thread)

Would you try peter's sched/experimental branch and see how that affects your
workload? I'm still messing with my patches and I may drop this one as it now
appears to be too aggressive with the new set of patches. Thanks,

Sure, I'll take a look at those, thanks. I guess the idea of caching
values in LB and then using them in wakeup[2] is a lighter-handed way of
achieving the same thing as last_balance_ts? It won't solve my problem
directly since we'll still only look in sd_llc, but I think it could be
a basis for a way to say "go find_idlest_group path on these tasks" at
the beginning of select_task_rq_fair.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef

Thanks,
Brendan

Would it be better if it searches for idle cpus in next higher domain (if that is not NUMA) instead of doing find_idlest_group ?

The above approach [2] will still try to search a idle cpu in the llc domain of new_cpu. If it finds one great. Otherwise, let's search for an idle cpu in next higher domain(if that is not NUMA) excluding the current domain which we have already searched. I think it would help in cases where LLC < NUMA.

Regards,
Atish

Next message: H.J. Lu: "Re: new ELF marking"
Previous message: Hisashi T Fujinaka: "Re: [Intel-wired-lan] [regression] wake on lan no longer works in 4.13-rc3. was Re: Linux 4.13: Reported regressions as of Sunday, 2017-08-06"
Next in thread: Brendan Jackman: "Re: wake_wide mechanism clarification"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]