Re: [RFC PATCH] sched: Reduce overestimating avg_idle

From: Rik van Riel
Date: Wed Jul 31 2013

On 07/31/2013 05:37 AM, Jason Low wrote:
The avg_idle value may sometimes be overestimated, which may cause new idle
load balance to be attempted more often than it should. Currently, when
avg_idle gets updated, if the delta exceeds some max value (default 1000000 ns),
the entire avg gets set to the max value, regardless of what the previous avg
was. So if a CPU remains idle for 200,000 ns most of the time, and if the CPU
goes idle for 1,200,000 ns, the average is then pushed up to 1,000,000 ns when
it should be less.

Additionally, once the avg_idle is at its max, it may take a while to pull the
avg down to a value that it should be. In the above example, after the avg idle
is set the max value of 1000000 ns, the CPU's idle durations needs to
be 200000 ns for the next 8 occurrences before the avg falls below the migration
cost value.

This patch attempts to avoid these situations by always updating the avg_idle
value first with the function call to update_avg(). Then, if the avg_idle
exceeds the max avg value, the avg gets set to the max. Also, this patch lowers
the max avg_idle value to migration_cost * 1.5 instead of migration_cost * 2 to
reduce the time it takes to pull the avg idle to a lower value after long idles.

With this change, I got some decent performance boosts in AIM7 workloads on an
8 socket machine on the 3.10 kernel. In particular, it boosted the AIM7 fserver
workload by about 20% when running it with a high # of users.

An avg_idle related question that I have is does migration_cost in idle balance
need to be the same as the migration_cost in task_hot()? Can we keep
migration_cost default value used in task_hot() the same, but have a different
default value or increase migration_cost only when comparing it with avg_idle in
idle balance?

Signed-off-by: Jason Low <jason.low2@xxxxxx>

Reviewed-by: Rik van Riel <riel@xxxxxxxxxx>

