Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

From: Joel Fernandes
Date: Wed Sep 20 2017 - 00:40:22 EST


On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
<brendan.jackman@xxxxxxx> wrote:
> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>> Hi Brendan,
> Hi Joel,
>
> Thanks for taking a look :)
>
>> On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman
>> <brendan.jackman@xxxxxxx> wrote:
>>> This patch adds a parameter to select_task_rq, sibling_count_hint
>>> allowing the caller, where it has this information, to inform the
>>> sched_class the number of tasks that are being woken up as part of
>>> the same event.
>>>
>>> The wake_q mechanism is one case where this information is available.
>>>
>>> select_task_rq_fair can then use the information to detect that it
>>> needs to widen the search space for task placement in order to avoid
>>> overloading the last-level cache domain's CPUs.
>>>
>>> * * *
>>>
>>> The reason I am investigating this change is the following use case
>>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
>>> all repeatedly do X amount of work then
>>> pthread_barrier_wait (i.e. sleep until the last task finishes its X
>>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
>>> finish faster, and then those CPUs pull over the tasks that are still
>>> running:
>>>
>>> v CPU v ->time->
>>>
>>> -------------
>>> 0 (big) 11111 /333
>>> -------------
>>> 1 (big) 22222 /444|
>>> -------------
>>> 2 (LITTLE) 333333/
>>> -------------
>>> 3 (LITTLE) 444444/
>>> -------------
>>
>> This is because of the newly idle balancing right?
>
> Probably not: before CPUs 0 and 1 will pull tasks their utilization
> (tasks 1 and 2's blocked load) needs to decay enough that they have more
> spare capacity than CPUs 2 and 3 (with imbalance_pct etc), so it's more
> likely to be a nohz idle balance. We also need need_active_balance to be
> satisfied (assuming 3 and 4 are running constantly) - in Android this is
> hastened by a special case [1], in mainline it takes longer.
>
> [1] https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/fair.c#8865

Ok, I agree.

>
>>>
>>> Now when task 4 hits the barrier (at |) and wakes the others up,
>>> there are 4 tasks with prev_cpu=<big> and 0 tasks with
>>> prev_cpu=<little>. want_affine therefore means that we'll only look
>>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
>>> on the bigs until the next load balance, something like this:
>>>
>>> v CPU v ->time->
>>>
>>> ------------------------
>>> 0 (big) 11111 /333 31313\33333
>>> ------------------------
>>> 1 (big) 22222 /444|424\4444444
>>> ------------------------
>>> 2 (LITTLE) 333333/ \222222
>>> ------------------------
>>> 3 (LITTLE) 444444/ \1111
>>> ------------------------
>>> ^^^
>>> underutilization
>>
>> I wonder if this problem can be better solved by not having the first
>> newly-idle-balance pull happen based on some condition, only to be
>> corrected later by another balancing act?
>
> Well the fact that CPUs 0 and 1 pull 3 and 4 is desirable - we want to
> use the big CPUs as much as possible for these always-running
> tasks. Moving load back to the LITTLE CPUs is not "correcting" a
> previous mistake, it's responding to a change in conditions (i.e. we
> have more runnable tasks than big CPUs now) and this patch is about
> trying to do that response ASAP.

Agreed.

>>>
>>> So, I'm trying to get want_affine = 0 for these tasks.
>>>
>>> I don't _think_ any incarnation of the wakee_flips mechanism can help
>>> us here because which task is waker and which tasks are wakees
>>> generally changes with each iteration.
>>>
>>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
>>> nice property that we know exactly how many tasks are being woken, so
>>> we can cheat.
>>>
>>> It might be a disadvantage that we "widen" _every_ task that's woken in
>>> an event, while select_idle_sibling would work fine for the first
>>> sd_llc_size - 1 tasks.
>>>
>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>> necessary on SMP systems, so it might be best guarded by the presence
>>
>> Actually wake_affine doesn't check for balance if previous/next cpu
>> are within the same shared cache domain. The difference is some time
>> ago it would return true for shared cache but now it returns false as
>> of 4.14-rc1:
>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>
>> Since it would return false in the above wake up cases for task 1 and
>> 2, it would then run select_idle_sibling on the previous CPU which is
>> also within the big cluster, so I don't think it will make a
>> difference in this case... Infact what it returns probably doesn't
>> matter.
>
> So my paragraph here was making a leap in reasoning, let me try to fill
> the gap: On SMP these tasks never need to move around. If by some chance
> they did get coscheduled, the first load balance would spread them out and
> then every time they wake up from then on, prev_cpu is the sensible
> choice. So it will look something like:
>
> v CPU v ->time->
>
> -------------
> { 0 (SAME) 11111111111
> cache { -------------
> { 1 (SAME) 222222222222|
> -------------
> { 2 (SAME) 33333333333
> cache { -------------
> { 3 (SAME) 44444444444
> -------------
>
> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
> basically wake_affine's job to decide between prev_cpu and
> smp_processor_id(). So "if wake_affine is behaving correctly" the
> problem that this patch aims to solve (i.e. the fact that we overload
> the waker's LLC domain because of bias towards prev_cpu) does not arise
> on SMP.
>

Yes SMP, but your patch is for solving a problem for non-SMP. So your
original statement about wake_affine solving any problem for SMP is
not relevant I feel :-P. I guess you can just kill this para from the
commit message to prevent confusion.

>>> of SD_ASYM_CPUCAPACITY?
>>>
>>> * * *
>>>
>>> Final note..
>>>
>>> In order to observe "perfect" behaviour for this use case, I also had
>>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
>>> above we are working through the work queue and have placed tasks 3
>>> and 2, and are about to place task 1:
>>>
>>> v CPU v ->time->
>>>
>>> --------------
>>> 0 (big) 11111 /333 3
>>> --------------
>>> 1 (big) 22222 /444|4
>>> --------------
>>> 2 (LITTLE) 333333/ 2
>>> --------------
>>> 3 (LITTLE) 444444/ <- Task 1 should go here
>>> --------------
>>>
>>> If TTWU_QUEUE is enabled, we will not yet have enqueued task
>>> 2 (having instead sent a reschedule IPI) or attached its load to CPU
>>> 2. So we are likely to also place task 1 on cpu 2. Disabling
>>> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
>>> solving this issue. TTWU_QUEUE is there to minimise rq lock
>>> contention, and I guess that this contention is less of an issue on
>>> big.LITTLE systems since they have relatively few CPUs, which
>>> suggests the trade-off makes sense here.
>>
>> Yeah, OTOH probably once those IPIs are sent, the select path should I
>> feel be smart enough to factor those pending enqueues, not sure how
>> hard or meaningful it is to implement something like that..
>
> Yeah that should be possible - in fact I've just noticed that
> find_idlest_cpu already checks idle_cpu() which takes pending enqueues
> (rq->wake_list) into account, and I think find_idlest_cpu should be used
> in the case I'm talking about here.. I must have been doing something
> wrong when testing.

Oh ok!

thanks,

- Joel

[..]