Re: [RFC PATCH 1/1] sched: do active load balance in balance callback

From: Yafang Shao
Date: Mon Jul 19 2021 - 08:12:29 EST


On Wed, Jul 14, 2021 at 10:23 PM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> On 11/07/2021 09:40, Yafang Shao wrote:
> > The active load balance which means to migrate the CFS task running on
> > the busiest CPU to the new idle CPU has a known issue[1][2] that
> > there are some race window between waking up the migration thread on the
> > busiest CPU and it begins to preempt the current running CFS task.
> > These race window may cause unexpected behavior that the latency
> > sensitive RT tasks may be preempted by the migration thread as it has a
> > higher priority.
> >
> > This RFC patch tries to improve this situation. Instead of waking up the
> > migration thread to do this work, this patch do it in the balance
> > callback as follows,
> >
> > The New idle CPUm The target CPUn
> > find the target task A CFS task A is running
> > queue it into the target CPUn A is scheduling out
> > do balance callback and migrate A to CPUm
> > It avoids two context switches - task A to migration/n and migration/n to
> > task B. And it avoids preempting the RT task if the RT task has already
> > preempted task A before we do the queueing.
> >
> > TODO:
> > - I haven't done some benchmark to measure the impact on performance
> > - To avoid deadlock I have to unlock the busiest_rq->lock before
> > calling attach_one_task() and lock it again after executing
> > attach_one_task(). That may re-introduce the issue addressed by
> > commit 565790d28b1e ("sched: Fix balance_callback()")
> >
> > [1]. https://lore.kernel.org/lkml/CAKfTPtBygNcVewbb0GQOP5xxO96am3YeTZNP5dK9BxKHJJAL-g@xxxxxxxxxxxxxx/
> > [2]. https://lore.kernel.org/lkml/20210615121551.31138-1-laoar.shao@xxxxxxxxx/
>
> This didn't apply for me and I guess won't compile on tip/sched/core:
>
> raw_spin_{,un}lock(&busiest_rq->lock) -> raw_spin_rq_{,un}lock(busiest_rq)
>
> p->state == TASK_RUNNING -> p->__state or task_is_running(p)
>

I made this patch based on Linus's tree. I will do it based on tip/sched/core.

> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > ---
> > kernel/sched/core.c | 1 +
> > kernel/sched/fair.c | 69 ++++++++++++++------------------------------
> > kernel/sched/sched.h | 6 +++-
> > 3 files changed, 28 insertions(+), 48 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ca80df205ce..a0a90a37e746 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -8208,6 +8208,7 @@ void __init sched_init(void)
> > rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
> > rq->balance_callback = &balance_push_callback;
> > rq->active_balance = 0;
> > + rq->active_balance_target = NULL;
> > rq->next_balance = jiffies;
> > rq->push_cpu = 0;
> > rq->cpu = i;
>
> [...]
>
> > +DEFINE_PER_CPU(struct callback_head, active_balance_head);
> > +
> > /*
> > * Check this_cpu to ensure it is balanced within domain. Attempt to move
> > * tasks if there is an imbalance.
> > @@ -9845,15 +9817,14 @@ static int load_balance(int this_cpu, struct
> > rq *this_rq,
> > if (!busiest->active_balance) {
> > busiest->active_balance = 1;
> > busiest->push_cpu = this_cpu;
> > + busiest->active_balance_target = busiest->curr;
> > active_balance = 1;
> > }
> > - raw_spin_unlock_irqrestore(&busiest->lock, flags);
> >
> > - if (active_balance) {
> > - stop_one_cpu_nowait(cpu_of(busiest),
> > - active_load_balance_cpu_stop, busiest,
> > - &busiest->active_balance_work);
> > - }
> > + if (active_balance)
> > + queue_balance_callback(busiest,
> > &per_cpu(active_balance_head, busiest->cpu),
> > active_load_balance_cpu_stop);
>
>
> When you defer the active load balance of p into a balance_callback
> (from __schedule()) p has to stop running on busiest, right?

Right. But p doesn't have to stop running it immediately.

> Deferring active load balance for too long might be defeat the purpose
> of load balance which has to happen now.
>

Maybe we need to do some benchmark to measure whether it is proper to
deter the active load balance.
But I don't know which benchmark is suitable now.

> Also, before balance_callback get invoked, active balancing might try
> to migrate p again and again but fails because `busiest->active_balance`
> is still 1 (you kept this former synchronization meant for
> active_balance_work). In this case the likelihood increases that one of
> the error condition in active_load_balance_cpu_stop() hit when it's
> finally called.
>

Seems that is a problem. I will think about it.

> What's wrong with the FIFO-1 "stopper" for CFS active lb?
>

We have to introduce another per-cpu kernel thread, but I don't know
whether it is worth doing it.


--
Thanks
Yafang