Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

From: Brendan Jackman
Date: Tue Sep 19 2017 - 06:05:24 EST


On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
> Hi Brendan,
Hi Joel,

Thanks for taking a look :)

> On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman
> <brendan.jackman@xxxxxxx> wrote:
>> This patch adds a parameter to select_task_rq, sibling_count_hint
>> allowing the caller, where it has this information, to inform the
>> sched_class the number of tasks that are being woken up as part of
>> the same event.
>>
>> The wake_q mechanism is one case where this information is available.
>>
>> select_task_rq_fair can then use the information to detect that it
>> needs to widen the search space for task placement in order to avoid
>> overloading the last-level cache domain's CPUs.
>>
>> * * *
>>
>> The reason I am investigating this change is the following use case
>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
>> all repeatedly do X amount of work then
>> pthread_barrier_wait (i.e. sleep until the last task finishes its X
>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
>> finish faster, and then those CPUs pull over the tasks that are still
>> running:
>>
>> v CPU v ->time->
>>
>> -------------
>> 0 (big) 11111 /333
>> -------------
>> 1 (big) 22222 /444|
>> -------------
>> 2 (LITTLE) 333333/
>> -------------
>> 3 (LITTLE) 444444/
>> -------------
>
> This is because of the newly idle balancing right?

Probably not: before CPUs 0 and 1 will pull tasks their utilization
(tasks 1 and 2's blocked load) needs to decay enough that they have more
spare capacity than CPUs 2 and 3 (with imbalance_pct etc), so it's more
likely to be a nohz idle balance. We also need need_active_balance to be
satisfied (assuming 3 and 4 are running constantly) - in Android this is
hastened by a special case [1], in mainline it takes longer.

[1] https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/fair.c#8865

>>
>> Now when task 4 hits the barrier (at |) and wakes the others up,
>> there are 4 tasks with prev_cpu=<big> and 0 tasks with
>> prev_cpu=<little>. want_affine therefore means that we'll only look
>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
>> on the bigs until the next load balance, something like this:
>>
>> v CPU v ->time->
>>
>> ------------------------
>> 0 (big) 11111 /333 31313\33333
>> ------------------------
>> 1 (big) 22222 /444|424\4444444
>> ------------------------
>> 2 (LITTLE) 333333/ \222222
>> ------------------------
>> 3 (LITTLE) 444444/ \1111
>> ------------------------
>> ^^^
>> underutilization
>
> I wonder if this problem can be better solved by not having the first
> newly-idle-balance pull happen based on some condition, only to be
> corrected later by another balancing act?

Well the fact that CPUs 0 and 1 pull 3 and 4 is desirable - we want to
use the big CPUs as much as possible for these always-running
tasks. Moving load back to the LITTLE CPUs is not "correcting" a
previous mistake, it's responding to a change in conditions (i.e. we
have more runnable tasks than big CPUs now) and this patch is about
trying to do that response ASAP.

>>
>> So, I'm trying to get want_affine = 0 for these tasks.
>>
>> I don't _think_ any incarnation of the wakee_flips mechanism can help
>> us here because which task is waker and which tasks are wakees
>> generally changes with each iteration.
>>
>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
>> nice property that we know exactly how many tasks are being woken, so
>> we can cheat.
>>
>> It might be a disadvantage that we "widen" _every_ task that's woken in
>> an event, while select_idle_sibling would work fine for the first
>> sd_llc_size - 1 tasks.
>>
>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>> necessary on SMP systems, so it might be best guarded by the presence
>
> Actually wake_affine doesn't check for balance if previous/next cpu
> are within the same shared cache domain. The difference is some time
> ago it would return true for shared cache but now it returns false as
> of 4.14-rc1:
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>
> Since it would return false in the above wake up cases for task 1 and
> 2, it would then run select_idle_sibling on the previous CPU which is
> also within the big cluster, so I don't think it will make a
> difference in this case... Infact what it returns probably doesn't
> matter.

So my paragraph here was making a leap in reasoning, let me try to fill
the gap: On SMP these tasks never need to move around. If by some chance
they did get coscheduled, the first load balance would spread them out and
then every time they wake up from then on, prev_cpu is the sensible
choice. So it will look something like:

v CPU v ->time->

-------------
{ 0 (SAME) 11111111111
cache { -------------
{ 1 (SAME) 222222222222|
-------------
{ 2 (SAME) 33333333333
cache { -------------
{ 3 (SAME) 44444444444
-------------

So here, task 2 wakes up the other guys and when it's doing tasks 3 and
4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
basically wake_affine's job to decide between prev_cpu and
smp_processor_id(). So "if wake_affine is behaving correctly" the
problem that this patch aims to solve (i.e. the fact that we overload
the waker's LLC domain because of bias towards prev_cpu) does not arise
on SMP.

>> of SD_ASYM_CPUCAPACITY?
>>
>> * * *
>>
>> Final note..
>>
>> In order to observe "perfect" behaviour for this use case, I also had
>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
>> above we are working through the work queue and have placed tasks 3
>> and 2, and are about to place task 1:
>>
>> v CPU v ->time->
>>
>> --------------
>> 0 (big) 11111 /333 3
>> --------------
>> 1 (big) 22222 /444|4
>> --------------
>> 2 (LITTLE) 333333/ 2
>> --------------
>> 3 (LITTLE) 444444/ <- Task 1 should go here
>> --------------
>>
>> If TTWU_QUEUE is enabled, we will not yet have enqueued task
>> 2 (having instead sent a reschedule IPI) or attached its load to CPU
>> 2. So we are likely to also place task 1 on cpu 2. Disabling
>> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
>> solving this issue. TTWU_QUEUE is there to minimise rq lock
>> contention, and I guess that this contention is less of an issue on
>> big.LITTLE systems since they have relatively few CPUs, which
>> suggests the trade-off makes sense here.
>
> Yeah, OTOH probably once those IPIs are sent, the select path should I
> feel be smart enough to factor those pending enqueues, not sure how
> hard or meaningful it is to implement something like that..

Yeah that should be possible - in fact I've just noticed that
find_idlest_cpu already checks idle_cpu() which takes pending enqueues
(rq->wake_list) into account, and I think find_idlest_cpu should be used
in the case I'm talking about here.. I must have been doing something
wrong when testing.

Cheers,
Brendan

>>
>> Signed-off-by: Brendan Jackman <brendan.jackman@xxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Cc: Josef Bacik <josef@xxxxxxxxxxxxxx>
>> Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
>> Cc: Mike Galbraith <efault@xxxxxx>
>> Cc: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
>> ---
>> include/linux/sched/wake_q.h | 2 ++
>> kernel/sched/core.c | 34 +++++++++++++++++++++++-----------
>> kernel/sched/deadline.c | 3 ++-
>> kernel/sched/fair.c | 17 +++++++++++------
>> kernel/sched/idle_task.c | 3 ++-
>> kernel/sched/rt.c | 3 ++-
>> kernel/sched/sched.h | 3 ++-
>> kernel/sched/stop_task.c | 3 ++-
>> 8 files changed, 46 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>> index d03d8a9047dc..607a888eb35b 100644
>> --- a/include/linux/sched/wake_q.h
>> +++ b/include/linux/sched/wake_q.h
>> @@ -33,6 +33,7 @@
>> struct wake_q_head {
>> struct wake_q_node *first;
>> struct wake_q_node **lastp;
>> + int count;
>> };
>>
>> #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
>> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>> {
>> head->first = WAKE_Q_TAIL;
>> head->lastp = &head->first;
>> + head->count = 0;
>> }
>>
>> extern void wake_q_add(struct wake_q_head *head,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0869b20fba81..ddf9257b1467 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>> if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>> return;
>>
>> + head->count++;
>> +
>> get_task_struct(task);
>>
>> /*
>> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>> head->lastp = &node->next;
>> }
>>
>> +static int
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> + int sibling_count_hint);
>> +
>> void wake_up_q(struct wake_q_head *head)
>> {
>> struct wake_q_node *node = head->first;
>> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>> task->wake_q.next = NULL;
>>
>> /*
>> - * wake_up_process() implies a wmb() to pair with the queueing
>> + * try_to_wake_up() implies a wmb() to pair with the queueing
>> * in wake_q_add() so as not to miss wakeups.
>> */
>> - wake_up_process(task);
>> + try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>> put_task_struct(task);
>
> Shouldn't you reset head->count after all the tasks have been woken up?
>
>> }
>> }
>> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>> * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>> */
>> static inline
>> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
>> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
>> + int sibling_count_hint)
>
> This variable seems a bit long to me, how about just sibling_count?
>
>> {
>> lockdep_assert_held(&p->pi_lock);
>>
>> if (p->nr_cpus_allowed > 1)
>> - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
>> + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
>> + sibling_count_hint);
>
> same.
>
> <snip>
>
>>
>> static int
>> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
>> + int sibling_count_hint)
>> {
>> struct task_struct *curr;
>> struct rq *rq;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index eeef1a3086d1..56ae525618e9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1419,7 +1419,8 @@ struct sched_class {
>> void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>>
>> #ifdef CONFIG_SMP
>> - int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>> + int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
>> + int subling_count_hint);
>
> s/subling/sibling/
>
>
> thanks,
>
> - Joel