Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

From: Peter Zijlstra
Date: Mon Mar 30 2015 - 08:25:19 EST


On Mon, Mar 30, 2015 at 01:03:03PM +0100, Morten Rasmussen wrote:
> On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
> >
> > > I agree that it is hard to predict how many additional cpus you need,
> > > but I don't think you necessarily need that information as long as you
> > > start by filling up the cpu that was kicked to do the
> > > nohz_idle_balance() first.
> >
> > > Reducing unnecessary wakeups is quite important for energy consumption
> > > and something a lot of effort is put into. You really don't want to wake
> > > up another cluster/package unnecessarily just because there was only one
> > > nohz-idle cpu left in the previous one which could have handled the
> > > additional load. It gets even worse if the other cluster is less
> > > energy-efficient (big.LITTLE).
> >
> > So the only way to get tasks to cross your cluster is by balancing that
> > domain. At this point we'll compute sg stats for either group
> > (=cluster).
> >
> > The only thing we need to ensure is that it doesn't view the small
> > cluster as overloaded (as long as it really isn't of course), as long as
> > its not viewed as overloaded it will not pull tasks from it into the big
> > cluster, no matter how many ILBs we run before the ILB duty cpu's
> > rebalance_domains() call.
> >
> > I'm really not seeing the problem here.
>
> I see. The group_classify() should take care of it in all cases of
> balancing across clusters. You would be iterating over all cpus in the
> other cluster running rebalance_domains() if the balancer cpu happens to
> be the last one in the little cluster though. However, within the
> cluster (in case you have 2 or more nohz-idle cpus) you still take a
> double hit. No?

It can yes, but typically not I think. This all could use some 'help'
for sure.

So the thing is, find_new_ilb() simply selects the first idle_cpus_mask
cpu, while at the same time, nohz_idle_balance() will iterate the
idle_cpus_mask with the first, being first (obviously).

So it is very like that if we migrate on the ILB it is in fact to the
current CPU.

In case we cannot, we have no choice but to wake up a second idle,
nothing really to be done about that.

To put it another way, for ILB purposes the rebalance_domains() call is
mostly superfluous. The only other case is if the selected ILB target
became non-idle between being selected and getting to run the softirq
handler. At which point we should wake another anyhow too.

Maybe something like the below helps -- albeit it could use a comment
too of course.

---
kernel/sched/fair.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..b879d4b3b599 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7624,11 +7624,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.
*/
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
{
int this_cpu = this_rq->cpu;
- struct rq *rq;
int balance_cpu;
+ struct rq *rq;
+ bool done = false;

if (idle != CPU_IDLE ||
!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
@@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
break;

rq = cpu_rq(balance_cpu);
+ if (rq == this_rq)
+ done = true;

/*
* If time for next balance is due,
@@ -7666,6 +7669,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
nohz.next_balance = this_rq->next_balance;
end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+
+ return done;
}

/*
@@ -7744,7 +7749,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
return kick;
}
#else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { return false; }
#endif

/*
@@ -7765,8 +7770,8 @@ static void run_rebalance_domains(struct softirq_action *h)
* load balance only within the local sched_domain hierarchy
* and abort nohz_idle_balance altogether if we pull some load.
*/
- nohz_idle_balance(this_rq, idle);
- rebalance_domains(this_rq, idle);
+ if (!nohz_idle_balance(this_rq, idle))
+ rebalance_domains(this_rq, idle);
}

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/