[PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

From: Matt Fleming
Date: Wed Feb 08 2017 - 08:38:35 EST

The calculation for the next sample window when exiting NOH_HZ idle
does not handle the fact that we may not have reached the next sample
window yet, i.e. that we came out of idle between sample windows.

If we wake from NO_HZ idle after the pending this_rq->calc_load_update
window time when we want idle but before the next sample window, we
will add an unnecessary LOAD_FREQ delay to the load average
accounting, delaying any update for potentially ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Mike Galbraith <umgwanakikbuti@xxxxxxxxx>
Cc: Morten Rasmussen <morten.rasmussen@xxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> # v3.5+
Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
kernel/sched/loadavg.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a2d6eb71f06b..a7a6f3646970 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -199,6 +199,7 @@ void calc_load_enter_idle(void)
void calc_load_exit_idle(void)
struct rq *this_rq = this_rq();
+ unsigned long next_window;

* If we're still before the sample window, we're done.
@@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
* We woke inside or after the sample window, this means we're already
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
+ *
+ * The next window is 'calc_load_update' if we haven't reached it yet,
+ * and 'calc_load_update + 10' if we're inside the current window.
- this_rq->calc_load_update = calc_load_update;
- if (time_before(jiffies, this_rq->calc_load_update + 10))
- this_rq->calc_load_update += LOAD_FREQ;
+ next_window = calc_load_update;
+ if (time_in_range_open(jiffies, next_window, next_window + 10)
+ next_window += LOAD_FREQ;
+ this_rq->calc_load_update = next_window;

static long calc_load_fold_idle(void)