Re: [RFC PATCH] sched: Reduce overestimating avg_idle

From: Peter Zijlstra
Date: Wed Jul 31 2013 - 05:53:30 EST

On Wed, Jul 31, 2013 at 02:37:52AM -0700, Jason Low wrote:
> The avg_idle value may sometimes be overestimated, which may cause new idle
> load balance to be attempted more often than it should. Currently, when
> avg_idle gets updated, if the delta exceeds some max value (default 1000000 ns),
> the entire avg gets set to the max value, regardless of what the previous avg
> was. So if a CPU remains idle for 200,000 ns most of the time, and if the CPU
> goes idle for 1,200,000 ns, the average is then pushed up to 1,000,000 ns when
> it should be less.
> Additionally, once the avg_idle is at its max, it may take a while to pull the
> avg down to a value that it should be. In the above example, after the avg idle
> is set the max value of 1000000 ns, the CPU's idle durations needs to
> be 200000 ns for the next 8 occurrences before the avg falls below the migration
> cost value.
> This patch attempts to avoid these situations by always updating the avg_idle
> value first with the function call to update_avg(). Then, if the avg_idle
> exceeds the max avg value, the avg gets set to the max. Also, this patch lowers
> the max avg_idle value to migration_cost * 1.5 instead of migration_cost * 2 to
> reduce the time it takes to pull the avg idle to a lower value after long idles.

Indeed, this seems quite sensible.

> With this change, I got some decent performance boosts in AIM7 workloads on an
> 8 socket machine on the 3.10 kernel. In particular, it boosted the AIM7 fserver
> workload by about 20% when running it with a high # of users.

Nice :-)

> An avg_idle related question that I have is does migration_cost in idle balance
> need to be the same as the migration_cost in task_hot()? Can we keep
> migration_cost default value used in task_hot() the same, but have a different
> default value or increase migration_cost only when comparing it with avg_idle in
> idle balance?

No they're quite unrelated. I think you can measure the max time we've
ever spend in newidle balance and use that to clip the values.

Similarly, I've thought about how we updated the sd->avg_cost in the
previous patches and wondered if we should not track max_cost.

The 'only' down-side I could come up with is that its all ran from
SoftIRQ context which means IRQ/NMI/SMI can all stretch/warp the time it
takes to actually do the idle balance.

The idea behind using the max is that we want to reduce the chance we
overrun the averages and consume time we should have spend doing useful
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at