Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

From: Vincent Guittot
Date: Wed Jan 24 2018 - 03:25:47 EST

Next message: Serge Semin: "Re: [PATCH 14/14] MIPS: memblock: Deactivate bootmem allocator"
Previous message: Peter Zijlstra: "Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu"
Next in thread: Dietmar Eggemann: "Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+0000), Morten Rasmussen a écrit :
> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
> > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> > > Hi Peter,
> > >
> > > On 22 December 2017 at 21:42, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > > >> Right; but I figured we'd try and do it 'right' and see how horrible it
> > > >> is before we try and do funny things.
> > > >
> > > > So now it should have a 32ms tick for up to .5s when the system goes
> > > > completely idle.
> > > >
> > > > No idea how bad that is..
> > >
> > > I have tested your branch but the timer doesn't seem to fire correctly
> > > because i can still see blocked load in the use case i have run.
> > > I haven't found the reason yet
> >
> > Hi Peter,
> >
> > With the patch below on top of your branch, the blocked loads are updated and
> > decayed regularly. The main differences are:
> > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
> > The main drawback of this solution is that the load is blocked when the
> > system is fully idle with the advantage of not waking up a fully idle
> > system. We have to wait for the next tick or newly idle event for updating
> > blocked load when the system leaves idle stat which can be up to a tick long.
> > If this is too long, we can check for kicking ilb when task wakes up so the
> > blocked load will be updated as soon as the system leaves idle state.
> > The main advantage is that we don't wake up a fully idle system every 32ms to
> > update blocked load that will be not used.
> > - I'm working on one more improvement to use nohz_idle_balance in the newly
> > idle case when the system is not overloaded and
> > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to
> > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
> > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
> > of an idle cpus.
>
> This sound like what I meant in my other reply :-)
>
> It seems pointless to have a timer to update PELT if the system is
> completely idle, and when it isn't we can piggy back other events to
> make the updates happen.

The patch below implements what has been described above. It calls part of
nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much
time. This removes part of ilb that are kicked on an idle cpu for updating
the blocked load but the ratio really depends on when the tick happens compared
to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that
enables to update the blocked loads when a cpu becomes idle 1 period before
kicking an ilb and there is far less ilb because we give more chance to the
newly idle case (time_after is replaced by time_after_eq in idle_balance()).

The patch also uses a function cfs_rq_has_blocked, which only checks the
util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This
reduce significantly the number of update of blocked load. the *_avg will be
fully decayed in around 300~400ms but it's far longer for the *_sum which have
a higher resolution and we can easily reach almost seconds. But only the *_avg
are used to make decision so keeping some blocked *_sum is acceptable.

---
kernel/sched/fair.c | 121 +++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 92 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 898785d..ed90303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
return true;
}

+static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
+{
+ if (cfs_rq->avg.load_avg)
+ return true;
+
+ if (cfs_rq->avg.util_avg)
+ return true;
+
+ return false;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED

static void update_blocked_averages(int cpu)
@@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
*/
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);
- else
+
+ /* Don't need periodic decay once load/util_avg are null */
+ if (cfs_rq_has_blocked(cfs_rq))
done = false;
}

@@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (cfs_rq_is_decayed(cfs_rq))
+ if (cfs_rq_has_blocked(cfs_rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
@@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
*next_balance = next;
}

+static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle);
static void kick_ilb(unsigned int flags);

/*
@@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

- if (time_after(jiffies, next) && atomic_read(&nohz.stats_state))
+ /*
+ * Update blocked idle load if it has not been done for a
+ * while. Try to do it locally before entering idle but kick a
+ * ilb if it takes too much time and might delay next local
+ * wake up
+ */
+ if (time_after(jiffies, next) && atomic_read(&nohz.stats_state) &&
+ !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);

goto out;
@@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu)
if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
return;

+ rq->has_blocked_load = 1;
if (rq->nohz_tick_stopped)
return;

@@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu)
return;

rq->nohz_tick_stopped = 1;
- rq->has_blocked_load = 1;

cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
atomic_inc(&nohz.nr_cpus);
@@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu)
* enable the periodic update of the load of idle cpus
*/
atomic_set(&nohz.stats_state, 1);
-
}
#else
static inline void nohz_balancer_kick(struct rq *rq) { }
@@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

#ifdef CONFIG_NO_HZ_COMMON
/*
- * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
- * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ * Internal function that runs load balance for all idle cpus. The load balance
+ * can be a simple update of blocked load or a complete load balance with
+ * tasks movement depending of flags.
+ * For newly idle mode, we abort the loop if it takes too much time and return
+ * false to notify that the loop has not be completed and a ilb shoud be kick.
*/
-static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle)
{
/* Earliest time when we have to do rebalance again */
unsigned long now = jiffies;
@@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
bool has_blocked_load = false;
int update_next_balance = 0;
int this_cpu = this_rq->cpu;
- unsigned int flags;
int balance_cpu;
+ int ret = false;
struct rq *rq;
-
- if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
- return false;
-
- if (idle != CPU_IDLE) {
- atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
- return false;
- }
-
- /*
- * barrier, pairs with nohz_balance_enter_idle(), ensures ...
- */
- flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
- if (!(flags & NOHZ_KICK_MASK))
- return false;
+ u64 curr_cost = 0;

SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);

@@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
atomic_set(&nohz.stats_state, 0);

for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+ u64 t0, domain_cost;
+
+ t0 = sched_clock_cpu(this_cpu);
+
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;

@@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
*/
if (need_resched()) {
has_blocked_load = true;
- break;
+ goto abort;
+ }
+
+ /*
+ * If the update is done while CPU becomes idle, we abort
+ * the update when its cost is higher than the average idle
+ * time in orde to not delay a possible wake up.
+ */
+ if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) {
+ has_blocked_load = true;
+ goto abort;
}

rq = cpu_rq(balance_cpu);
@@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
if (time_after_eq(jiffies, rq->next_balance)) {
struct rq_flags rf;

- rq_lock_irq(rq, &rf);
+ rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
cpu_load_update_idle(rq);
- rq_unlock_irq(rq, &rf);
+ rq_unlock_irqrestore(rq, &rf);

if (flags & NOHZ_BALANCE_KICK)
rebalance_domains(rq, CPU_IDLE);
@@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
next_balance = rq->next_balance;
update_next_balance = 1;
}
+
+ domain_cost = sched_clock_cpu(this_cpu) - t0;
+ curr_cost += domain_cost;
+
}

- update_blocked_averages(this_cpu);
- has_blocked_load |= this_rq->has_blocked_load;
+ /* Newly idle CPU doesn't need an update */
+ if (idle != CPU_NEWLY_IDLE) {
+ update_blocked_averages(this_cpu);
+ has_blocked_load |= this_rq->has_blocked_load;
+ }

if (flags & NOHZ_BALANCE_KICK)
rebalance_domains(this_rq, CPU_IDLE);
@@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
WRITE_ONCE(nohz.next_stats,
now + msecs_to_jiffies(LOAD_AVG_PERIOD));

+ /* The full idle balance loop has been done */
+ ret = true;
+
+abort:
/* There is still blocked load, enable periodic update */
if (has_blocked_load)
atomic_set(&nohz.stats_state, 1);
@@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
if (likely(update_next_balance))
nohz.next_balance = next_balance;

+ return ret;
+}
+
+/*
+ * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
+ * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ */
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+{
+ int this_cpu = this_rq->cpu;
+ unsigned int flags;
+
+ if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+ return false;
+
+ if (idle != CPU_IDLE) {
+ atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+ return false;
+ }
+
+ /*
+ * barrier, pairs with nohz_balance_enter_idle(), ensures ...
+ */
+ flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+ if (!(flags & NOHZ_KICK_MASK))
+ return false;
+
+ _nohz_idle_balance(this_rq, flags, idle);
+
return true;
}
#else
--
2.7.4

Next message: Serge Semin: "Re: [PATCH 14/14] MIPS: memblock: Deactivate bootmem allocator"
Previous message: Peter Zijlstra: "Re: [PATCH v2] perf/core: Fix installing cgroup event into cpu"
Next in thread: Dietmar Eggemann: "Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]