When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
there can be a lot of contention in idle_balance() and related functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.
Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.
This RFC patch attempts to keep track of the approximate "average" time between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then updated
using a very similar method as how rq->avg_idle is computed.
Once the average time between idle balance attempts drops below a certain
value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
for that CPU will be skipped. The average time between idle balances will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and to
prevent the avg from being overestimated.
This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
sockets on the 3.10 kernel. The most significant differences were at
8 sockets HT-enabled. The table below compares the average jobs per minute
at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
patch. I included data for both hyperthreading disabled and enabled. I used
numactl to restrict AIM7 to run on certain number of nodes. I only included
data in which the % difference was beyond a 2% noise range.