Re: wake_wide mechanism clarification
From: Brendan Jackman
Date: Thu Aug 10 2017 - 05:48:21 EST
On Wed, Aug 09 2017 at 21:22, Atish Patra wrote:
> On 08/03/2017 10:05 AM, Brendan Jackman wrote:
>>
>> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
>>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>>>
>>>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>>>> (slave * factor).
>>>>>>
>>>>>>> Yeah I don't know why llc_size was chosen...
>>>>>>
>>>>>> static void update_top_cache_domain(int cpu)
>>>>>> {
>>>>>> struct sched_domain_shared *sds = NULL;
>>>>>> struct sched_domain *sd;
>>>>>> int id = cpu;
>>>>>> int size = 1;
>>>>>>
>>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>>>> if (sd) {
>>>>>> id = cpumask_first(sched_domain_span(sd));
>>>>>> size = cpumask_weight(sched_domain_span(sd));
>>>>>> sds = sd->shared;
>>>>>> }
>>>>>>
>>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>>>
>>>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>>>> consolidation effort and counterproductive to scaling. 'course with
>>>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>>>> before it needing to turn its minions loose to conquer the world.
>>>>>>
>>>>>> Something else to consider: network interrupt waking multiple workers
>>>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>>>> place a worker directly in front of a tattoo artist, or is it better
>>>>>> off nearly anywhere but there?
>>>>>>
>>>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>>>> get tattooed simultaneously (see sync wakeup).
>>>>>>
>>>>>
>>>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
>>>>> big of a hammer, we probably need a middle ground. I'm messing with it right
>>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>>>>> see are not because we overload the cpu we're trying to pull to, but because
>>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd
>>>>> instead of doing the full "find the idlest cpu in the land!" thing.
>>>>
>>>> This is the problem I've been hitting lately. My use case is 1 task per
>>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>>>> task per CPU, they all do X amount of work then pthread_barrier_wait
>>>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>>>> those CPUs pull over the tasks that are still running:
>>>>
>>>> v CPU v ->time->
>>>>
>>>> -------------
>>>> 0 (big) 11111 /333
>>>> -------------
>>>> 1 (big) 22222 /444|
>>>> -------------
>>>> 2 (LITTLE) 333333/
>>>> -------------
>>>> 3 (LITTLE) 444444/
>>>> -------------
>>>>
>>>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>>>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>>>> regardless of wake_affine, want_affine means that we'll only look in
>>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>>>> bigs until the next load balance, something like this:
>>>>
>>>> v CPU v ->time->
>>>>
>>>> ------------------------
>>>> 0 (big) 11111 /333 31313\33333
>>>> ------------------------
>>>> 1 (big) 22222 /444|424\4444444
>>>> ------------------------
>>>> 2 (LITTLE) 333333/ \222222
>>>> ------------------------
>>>> 3 (LITTLE) 444444/ \1111
>>>> ------------------------
>>>> ^^^
>>>> underutilization
>>>>
>>>>> I _think_
>>>>> the answer is to make select_idle_sibling() try less hard to find something
>>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to
>>>>> the full load balance esque search.
>>>>
>>>> So this idea of allowing select_idle_sibling to fail, and falling back
>>>> to the slow path, would help me too, I think.
>>>
>>> Unfortunately this statement of mine was wrong, I had it in my head that we
>>> would fall back to a find the idlest cpu thing provided we failed to wake
>>> affine, but we just do select_idle_sibling() and expect the load balancer to
>>> move things around as needed.
>>
>> Ah yes, when wake_affine() returns false, we still do
>> select_idle_sibling (except in prev_cpu's sd_llc instead of
>> smp_processor_id()'s), and that is the problem faced by my workload. I
>> thought you were suggesting to change the flow so that
>> select_idle_sibling can say "I didn't find any idle siblings - go to the
>> find_idlest_group path".
>>
>>>> This is also why I was playing with your
>>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>>>> since it prevents want_affine for tasks 3 and 4 (which were recently
>>>> moved by an active balance).
>>>>
>>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>>>> (also linked elsewhere in this thread)
>>>>
>>>
>>> Would you try peter's sched/experimental branch and see how that affects your
>>> workload? I'm still messing with my patches and I may drop this one as it now
>>> appears to be too aggressive with the new set of patches. Thanks,
>>
>> Sure, I'll take a look at those, thanks. I guess the idea of caching
>> values in LB and then using them in wakeup[2] is a lighter-handed way of
>> achieving the same thing as last_balance_ts? It won't solve my problem
>> directly since we'll still only look in sd_llc, but I think it could be
>> a basis for a way to say "go find_idlest_group path on these tasks" at
>> the beginning of select_task_rq_fair.
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef
>>
>> Thanks,
>> Brendan
>>
>
> Would it be better if it searches for idle cpus in next higher domain
> (if that is not NUMA) instead of doing find_idlest_group ?
>
> The above approach [2] will still try to search a idle cpu in the llc
> domain of new_cpu. If it finds one great. Otherwise, let's search for an
> idle cpu in next higher domain(if that is not NUMA) excluding the
> current domain which we have already searched. I think it would help in
> cases where LLC < NUMA.
Well for my use case, "search for idle cpus in the next higher domain"
and "go to find_busiest_group & co" mean the same thing. But for NUMA
boxes (where I'm guessing there are >=3 sched_domains?) then yeah that
sounds like a good idea at least to my inexpert ears. Currently it's
either "only look at siblings of either prev_cpu or smp_processor_id()"
or "look through the whole system (or at least as far as sd_flag takes
us)", so maybe you want middle ground?
On the other hand, I'm guessing remote wakeups across NUMA nodes are
expensive, so that's why we only look at siblings? There's already a
choice between prev_cpu's and smp_processor_id()'s LLC siblings.