Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

From: Peter Zijlstra
Date: Thu Feb 09 2017 - 10:55:21 EST

Next message: Arnd Bergmann: "[PATCH] cpufreq: ti: move to new dev_pm_opp_set_supported_hw() API"
Previous message: Laura Abbott: "Re: [PATCH] w1: ds2490: USB transfer buffers need to be DMAable"
In reply to: kbuild test robot: "Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting"
Next in thread: Frederic Weisbecker: "Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> The calculation for the next sample window when exiting NOH_HZ idle
> does not handle the fact that we may not have reached the next sample
> window yet, i.e. that we came out of idle between sample windows.
>
> If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> window time when we want idle but before the next sample window, we
> will add an unnecessary LOAD_FREQ delay to the load average
> accounting, delaying any update for potentially ~9seconds.
>
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
>
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().

Right, so differently put; the problem is that we check against the
'stale' rq->calc_load_update, while the current and effective period
boundary is 'calc_load_update'.

So, when rq->calc_load_update < jiffies < calc_load_update, we end up
setting the next-update to calc_load_update+LOAD_FREQ, where it should
have been calc_load_update.

> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index a2d6eb71f06b..a7a6f3646970 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c

> @@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
> * We woke inside or after the sample window, this means we're already
> * accounted through the nohz accounting, so skip the entire deal and
> * sync up for the next window.
> + *
> + * The next window is 'calc_load_update' if we haven't reached it yet,
> + * and 'calc_load_update + 10' if we're inside the current window.
> */
> + next_window = calc_load_update;
> +
> + if (time_in_range_open(jiffies, next_window, next_window + 10)
> + next_window += LOAD_FREQ;
> +
> + this_rq->calc_load_update = next_window;
> }

So I don't much like the time_in_range_open() thing. The simpler patch
which you tested to also work was:

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b7308eca..cfb47bd0ee50 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -201,6 +201,8 @@ void calc_load_exit_idle(void)
{
struct rq *this_rq = this_rq();

+ this_rq->calc_load_update = calc_load_update;
+
/*
* If we're still before the sample window, we're done.
*/
@@ -212,7 +214,6 @@ void calc_load_exit_idle(void)
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
*/
- this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
}

But the problem there is that we unconditionally issue that store. Now
I've no idea how much of a problem that is, and it certainly is the
simplest form (+- comments that need updating), so maybe that makes
sense.

Alternatively, something like:

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b7308eca..3dd4ce6fe151 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -207,12 +207,15 @@ void calc_load_exit_idle(void)
if (time_before(jiffies, this_rq->calc_load_update))
return;

+ this_rq->calc_load_update = calc_load_update;
+ if (time_before(jiffies, this_rq->calc_load_update))
+ return;
+
/*
* We woke inside or after the sample window, this means we're already
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
*/
- this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
}

might be another solution.

Irrespective the above though; should we not make this:

+ this_rq->calc_load_update = READ_ONCE(calc_load_update);

because if for some reason we do a double load of calc_load_update and
see two different values, weird stuff could happen.

And because, on general principle, a READ_ONCE() should be paired with a
WRITE_ONCE(), that should be done too I suppose.

Next message: Arnd Bergmann: "[PATCH] cpufreq: ti: move to new dev_pm_opp_set_supported_hw() API"
Previous message: Laura Abbott: "Re: [PATCH] w1: ds2490: USB transfer buffers need to be DMAable"
In reply to: kbuild test robot: "Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting"
Next in thread: Frederic Weisbecker: "Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]