Re: [RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

From: Giovanni Gherdovich
Date: Wed Oct 31 2018 - 14:32:24 EST

On Fri, 2018-10-26 at 11:12 +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> [... cut ...]
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.ÂÂFirst, it
> doesn't use "correction factors" for the time till the closest timer,
> but instead it tries to correlate the measured idle duration values
> with the available idle states and use that information to pick up
> the idle state that is most likely to "match" the upcoming CPU idle
> interval.ÂÂSecond, it doesn't take the number of "I/O waiters" into
> account at all and the pattern detection code in it tries to avoid
> taking timer wakeups into account.ÂÂIt also only uses idle duration
> values less than the current time till the closest timer (with the
> tick excluded) for that purpose.
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> ---
> The v2 is a re-write of major parts of the original patch.
> The approach the same in general, but the details have changed significantly
> with respect to the previous version.ÂÂIn particular:
> * The decay of the idle state metrics is implemented differently.
> * There is a more "clever" pattern detection (sort of along the lines
> Â of what the menu does, but simplified quite a bit and trying to avoid
> Â including timer wakeups).
> * The "promotion" from the "polling" state is gone.
> * The "safety net" wakeups are treated as the CPU might have been idle
> Â until the closest timer.
> I'm running this governor on all of my systems now without any
> visible adverse effects.
> Overall, it selects deeper idle states more often than menu on average, but
> that doesn't seem to make a significant difference in the majority of cases.
> In this preliminary revision it overtakes menu as the default governor
> for tickless systems (due to the higher rating), but that is likely
> to change going forward.ÂÂAt this point I'm mostly asking for feedback
> and possibly testing with whatever workloads you can throw at it.
> The patch should apply on top of 4.19, although I'm running it on
> top of my linux-next branch.ÂÂThis version hasn't been run through
> benchmarks yet and that likely will take some time as I will be
> traveling quite a bit during the next few weeks.
> ---
> Âdrivers/cpuidle/KconfigÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂÂ11Â
> Âdrivers/cpuidle/governors/Makefile |ÂÂÂÂ1Â
> Âdrivers/cpuidle/governors/teo.cÂÂÂÂ|ÂÂ491 +++++++++++++++++++++++++++++++++++++
> Â3 files changed, 503 insertions(+)
> [... cut ...]

Hello Rafael,

your new governor has a neutral impact on performance, as you expected. This is
a positive result, since the purpose of "teo" is to give improved
predictions on idle times without regressing on the performance side. There
are swings here and there but nothing looks extremely bad. v2 is largely
equivalent to v1 in my tests, except for sockperf and netperf on the
Haswell machine (v2 slightly worse) and tbench on the Skylake machine
(again v2 slightly worse).

I've tested your patches applying them on v4.18 (plus the backport
necessary for v2 as Doug helpfully noted), just because it was the latest
release when I started preparing this.

I've tested it on three machines, with different generations of Intel CPUs:

* single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
* two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here onwards)
* two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here onwards)


These are the workloads where no noticeable difference is measured (on both
v1 and v2, all machines), together with the corresponding MMTests[1]
configuration file name:

* pgbench read-only on xfs, pgbench read/write on xfs
* global-dhp__db-pgbench-timed-ro-small-xfs
* global-dhp__db-pgbench-timed-rw-small-xfs
* siege
* global-dhp__http-siege
* hackbench, pipetest
* global-dhp__scheduler-unbound
* Linux kernel compilation
* global-dhp__workload_kerndevel-xfs
* NASA Parallel Benchmarks, C-Class (linear algebra; run both with OpenMP
 and OpenMPI, over xfs)
* global-dhp__nas-c-class-mpi-full-xfs
* global-dhp__nas-c-class-omp-full
* FIO (Flexible IO) in several configurations
* global-dhp__io-fio-randread-async-randwrite-xfs
* global-dhp__io-fio-randread-async-seqwrite-xfs
* global-dhp__io-fio-seqread-doublemem-32k-4t-xfs
* global-dhp__io-fio-seqread-doublemem-4k-4t-xfs
* netperf on loopback over TCP
* global-dhp__network-netperf-unbound


These are benchmarks which exhibit a variation in their performance;
you'll see the magnitude of the changes is moderate and it's highly variable
from machine to machine. All percentages refer to the v4.18 baseline. In
more than one case the Haswell machine seems to prefer v1 to v2.

* xfsrepair
* global-dhp__io-xfsrepair-xfs

teo-v1 teo-v2
8x-SKYLAKE-UMA 2% worse 2% worse
80x-BROADWELL-NUMA 1% worse 1% worse
48x-HASWELL-NUMA 1% worse 1% worse

* sqlite (insert operations on xfs)
* global-dhp__db-sqlite-insert-medium-xfs

teo-v1 teo-v2
8x-SKYLAKE-UMA no change no change
80x-BROADWELL-NUMA 2% worse 3% worse
48x-HASWELL-NUMA no change no change

* netperf on loopback over UDP
* global-dhp__network-netperf-unbound

teo-v1 teo-v2
8x-SKYLAKE-UMA no change 6% worse
80x-BROADWELL-NUMA 1% worse 4% worse
48x-HASWELL-NUMA 3% better 5% worse

* sockperf on loopback over TCP, mode "under load"
* global-dhp__network-sockperf-unbound

teo-v1 teo-v2
8x-SKYLAKE-UMA 6% worse no change
80x-BROADWELL-NUMA 7% better no change
48x-HASWELL-NUMA 3% better 2% worse

* sockperf on loopback over UDP, mode "throughput"
* global-dhp__network-sockperf-unbound

teo-v1 teo-v2
8x-SKYLAKE-UMA 1% worse 1% worse
80x-BROADWELL-NUMA 3% better 2% better
48x-HASWELL-NUMA 4% better 12% worse

* sockperf on loopback over UDP, mode "under load"
* global-dhp__network-sockperf-unbound

teo-v1 teo-v2
8x-SKYLAKE-UMA 3% worse 1% worse
80x-BROADWELL-NUMA 10% better 8% better
48x-HASWELL-NUMA 1% better no change

* dbench on xfs
ÂÂÂÂÂÂÂÂ* global-dhp__io-dbench4-async-xfs

teo-v1 teo-v2
8x-SKYLAKE-UMA 3% better 4% better
80x-BROADWELL-NUMA no change no change
48x-HASWELL-NUMA 6% worse 16% worse

* tbench on loopback
* global-dhp__network-tbench

teo-v1 teo-v2
8x-SKYLAKE-UMA 1% worse 10% worse
80x-BROADWELL-NUMA 1% worse 1% worse
48x-HASWELL-NUMA 1% worse 2% worse

* schbench
* global-dhp__workload_schbench

teo-v1 teo-v2
8x-SKYLAKE-UMA 1% better no change
80x-BROADWELL-NUMA 2% worse 1% worse
48x-HASWELL-NUMA 2% worse 3% worse

* gitsource on xfs (git unit tests, shell intensive)
* global-dhp__workload_shellscripts-xfs

teo-v1 teo-v2
8x-SKYLAKE-UMA no change no change
80x-BROADWELL-NUMA no change 1% better
48x-HASWELL-NUMA no change 1% better


Now some more detail. Each benchmark is run in a variety of configurations
(eg. number of threads, number of concurrent connections and so forth) each
of them giving a result. What you see above is the geometric mean of
"sub-results"; below is the detailed view where there was a regression
larger than 5% (either in v1 or v2, on any of the machines). That means
I'll exclude xfsrepar, sqlite, schbench and the git unit tests "gitsource"
that have negligible swings from the baseline.

In all tables asterisks indicate a statement about statistical
significance: the difference with baseline has a p-value smaller than 0.1
(small p-values indicate that the difference is real and not just random

NOTES: Test run in mode "stream" over UDP. The varying parameter is the
ÂÂÂÂmessage size in bytes. Each measurement is taken 5 times and the
ÂÂÂÂharmonic mean is reported.
MEASURES: Throughput in MBits/second, both on the sender and on the receiver end.
HIGHER is better

machine: 8x-SKYLAKE-UMA
HmeanÂÂÂÂÂsend-64ÂÂÂÂÂÂÂÂÂ362.27 (ÂÂÂ0.00%)ÂÂÂÂÂÂ362.87 (ÂÂÂ0.16%)ÂÂÂÂÂÂ318.85 * -11.99%*
HmeanÂÂÂÂÂsend-128ÂÂÂÂÂÂÂÂ723.17 (ÂÂÂ0.00%)ÂÂÂÂÂÂ723.66 (ÂÂÂ0.07%)ÂÂÂÂÂÂ660.96 *ÂÂ-8.60%*
HmeanÂÂÂÂÂsend-256ÂÂÂÂÂÂÂ1435.24 (ÂÂÂ0.00%)ÂÂÂÂÂ1427.08 (ÂÂ-0.57%)ÂÂÂÂÂ1346.22 *ÂÂ-6.20%*
HmeanÂÂÂÂÂsend-1024ÂÂÂÂÂÂ5563.78 (ÂÂÂ0.00%)ÂÂÂÂÂ5529.90 *ÂÂ-0.61%*ÂÂÂÂÂ5228.28 *ÂÂ-6.03%*
HmeanÂÂÂÂÂsend-2048ÂÂÂÂÂ10935.42 (ÂÂÂ0.00%)ÂÂÂÂ10809.66 *ÂÂ-1.15%*ÂÂÂÂ10521.14 *ÂÂ-3.79%*
HmeanÂÂÂÂÂsend-3312ÂÂÂÂÂ16898.66 (ÂÂÂ0.00%)ÂÂÂÂ16539.89 *ÂÂ-2.12%*ÂÂÂÂ16240.87 *ÂÂ-3.89%*
HmeanÂÂÂÂÂsend-4096ÂÂÂÂÂ19354.33 (ÂÂÂ0.00%)ÂÂÂÂ19185.43 (ÂÂ-0.87%)ÂÂÂÂ18600.52 *ÂÂ-3.89%*
HmeanÂÂÂÂÂsend-8192ÂÂÂÂÂ32238.80 (ÂÂÂ0.00%)ÂÂÂÂ32275.57 (ÂÂÂ0.11%)ÂÂÂÂ29850.62 *ÂÂ-7.41%*
HmeanÂÂÂÂÂsend-16384ÂÂÂÂ48146.75 (ÂÂÂ0.00%)ÂÂÂÂ49297.23 *ÂÂÂ2.39%*ÂÂÂÂ48295.51 (ÂÂÂ0.31%)
HmeanÂÂÂÂÂrecv-64ÂÂÂÂÂÂÂÂÂ362.16 (ÂÂÂ0.00%)ÂÂÂÂÂÂ362.87 (ÂÂÂ0.19%)ÂÂÂÂÂÂ318.82 * -11.97%*
HmeanÂÂÂÂÂrecv-128ÂÂÂÂÂÂÂÂ723.01 (ÂÂÂ0.00%)ÂÂÂÂÂÂ723.66 (ÂÂÂ0.09%)ÂÂÂÂÂÂ660.89 *ÂÂ-8.59%*
HmeanÂÂÂÂÂrecv-256ÂÂÂÂÂÂÂ1435.06 (ÂÂÂ0.00%)ÂÂÂÂÂ1426.94 (ÂÂ-0.57%)ÂÂÂÂÂ1346.07 *ÂÂ-6.20%*
HmeanÂÂÂÂÂrecv-1024ÂÂÂÂÂÂ5562.68 (ÂÂÂ0.00%)ÂÂÂÂÂ5529.90 *ÂÂ-0.59%*ÂÂÂÂÂ5228.28 *ÂÂ-6.01%*
HmeanÂÂÂÂÂrecv-2048ÂÂÂÂÂ10934.36 (ÂÂÂ0.00%)ÂÂÂÂ10809.66 *ÂÂ-1.14%*ÂÂÂÂ10519.89 *ÂÂ-3.79%*
HmeanÂÂÂÂÂrecv-3312ÂÂÂÂÂ16898.65 (ÂÂÂ0.00%)ÂÂÂÂ16538.21 *ÂÂ-2.13%*ÂÂÂÂ16240.86 *ÂÂ-3.89%*
HmeanÂÂÂÂÂrecv-4096ÂÂÂÂÂ19351.99 (ÂÂÂ0.00%)ÂÂÂÂ19183.17 (ÂÂ-0.87%)ÂÂÂÂ18598.33 *ÂÂ-3.89%*
HmeanÂÂÂÂÂrecv-8192ÂÂÂÂÂ32238.74 (ÂÂÂ0.00%)ÂÂÂÂ32275.13 (ÂÂÂ0.11%)ÂÂÂÂ29850.39 *ÂÂ-7.41%*
HmeanÂÂÂÂÂrecv-16384ÂÂÂÂ48146.59 (ÂÂÂ0.00%)ÂÂÂÂ49296.23 *ÂÂÂ2.39%*ÂÂÂÂ48295.03 (ÂÂÂ0.31%)

NOTES: Test run in mode "under load" over TCP. Parameters are message size
ÂÂÂÂand transmission rate.
MEASURES: Round-trip time in microseconds
LOWER is better

machine: 8x-SKYLAKE-UMA
AmeanÂÂÂÂÂÂÂÂsize-14-rate-10000ÂÂÂÂÂÂÂÂ36.43 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ36.86 (ÂÂ-1.17%)ÂÂÂÂÂÂÂ20.24 (ÂÂ44.44%)
AmeanÂÂÂÂÂÂÂÂsize-14-rate-24000ÂÂÂÂÂÂÂÂ17.78 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ17.71 (ÂÂÂ0.36%)ÂÂÂÂÂÂÂ18.54 (ÂÂ-4.29%)
AmeanÂÂÂÂÂÂÂÂsize-14-rate-50000ÂÂÂÂÂÂÂÂ20.53 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ22.29 (ÂÂ-8.58%)ÂÂÂÂÂÂÂ16.16 (ÂÂ21.30%)
AmeanÂÂÂÂÂÂÂÂsize-100-rate-10000ÂÂÂÂÂÂÂ21.22 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ23.41 ( -10.35%)ÂÂÂÂÂÂÂ33.04 ( -55.73%)
AmeanÂÂÂÂÂÂÂÂsize-100-rate-24000ÂÂÂÂÂÂÂ17.81 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ21.09 ( -18.40%)ÂÂÂÂÂÂÂ14.39 (ÂÂ19.18%)
AmeanÂÂÂÂÂÂÂÂsize-100-rate-50000ÂÂÂÂÂÂÂ12.31 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ19.65 ( -59.64%)ÂÂÂÂÂÂÂ15.11 ( -22.77%)
AmeanÂÂÂÂÂÂÂÂsize-300-rate-10000ÂÂÂÂÂÂÂ34.21 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ35.30 (ÂÂ-3.19%)ÂÂÂÂÂÂÂ34.20 (ÂÂÂ0.05%)
AmeanÂÂÂÂÂÂÂÂsize-300-rate-24000ÂÂÂÂÂÂÂ24.52 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ26.00 (ÂÂ-6.04%)ÂÂÂÂÂÂÂ27.42 ( -11.81%)
AmeanÂÂÂÂÂÂÂÂsize-300-rate-50000ÂÂÂÂÂÂÂ20.20 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ20.39 (ÂÂ-0.95%)ÂÂÂÂÂÂÂ17.83 (ÂÂ11.73%)
AmeanÂÂÂÂÂÂÂÂsize-500-rate-10000ÂÂÂÂÂÂÂ21.56 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ21.31 (ÂÂÂ1.15%)ÂÂÂÂÂÂÂ29.32 ( -35.98%)
AmeanÂÂÂÂÂÂÂÂsize-500-rate-24000ÂÂÂÂÂÂÂ30.58 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ27.41 (ÂÂ10.38%)ÂÂÂÂÂÂÂ27.21 (ÂÂ11.03%)
AmeanÂÂÂÂÂÂÂÂsize-500-rate-50000ÂÂÂÂÂÂÂ19.46 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ22.48 ( -15.55%)ÂÂÂÂÂÂÂ16.29 (ÂÂ16.30%)
AmeanÂÂÂÂÂÂÂÂsize-850-rate-10000ÂÂÂÂÂÂÂ35.89 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ35.56 (ÂÂÂ0.91%)ÂÂÂÂÂÂÂ23.84 (ÂÂ33.57%)
AmeanÂÂÂÂÂÂÂÂsize-850-rate-24000ÂÂÂÂÂÂÂ29.11 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ28.18 (ÂÂÂ3.20%)ÂÂÂÂÂÂÂ17.44 (ÂÂ40.08%)
AmeanÂÂÂÂÂÂÂÂsize-850-rate-50000ÂÂÂÂÂÂÂ13.55 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ18.05 ( -33.26%)ÂÂÂÂÂÂÂ21.30 ( -57.20%)

NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
ÂÂÂÂmessage size.
MEASURES: Throughput, in MBits/second
HIGHER is better

machine: 48x-HASWELL-NUMA
HmeanÂÂÂÂÂ14ÂÂÂÂÂÂÂÂ48.16 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ50.94 *ÂÂÂ5.77%*ÂÂÂÂÂÂÂ42.50 * -11.77%*
HmeanÂÂÂÂÂ100ÂÂÂÂÂÂ346.77 (ÂÂÂ0.00%)ÂÂÂÂÂÂ358.74 *ÂÂÂ3.45%*ÂÂÂÂÂÂ303.31 * -12.53%*
HmeanÂÂÂÂÂ300ÂÂÂÂÂ1018.06 (ÂÂÂ0.00%)ÂÂÂÂÂ1053.75 *ÂÂÂ3.51%*ÂÂÂÂÂÂ895.55 * -12.03%*
HmeanÂÂÂÂÂ500ÂÂÂÂÂ1693.07 (ÂÂÂ0.00%)ÂÂÂÂÂ1754.62 *ÂÂÂ3.64%*ÂÂÂÂÂ1489.61 * -12.02%*
HmeanÂÂÂÂÂ850ÂÂÂÂÂ2853.04 (ÂÂÂ0.00%)ÂÂÂÂÂ2948.73 *ÂÂÂ3.35%*ÂÂÂÂÂ2473.50 * -13.30%*

NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
MEASURES: latency (millisecs)
LOWER is better

machine: 48x-HASWELL-NUMA
AmeanÂÂÂÂÂÂ1ÂÂÂÂÂÂÂÂ37.15 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ50.10 ( -34.86%)ÂÂÂÂÂÂÂ39.02 (ÂÂ-5.03%)
AmeanÂÂÂÂÂÂ2ÂÂÂÂÂÂÂÂ43.75 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ45.50 (ÂÂ-4.01%)ÂÂÂÂÂÂÂ44.36 (ÂÂ-1.39%)
AmeanÂÂÂÂÂÂ4ÂÂÂÂÂÂÂÂ54.42 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ58.85 (ÂÂ-8.15%)ÂÂÂÂÂÂÂ58.17 (ÂÂ-6.89%)
AmeanÂÂÂÂÂÂ8ÂÂÂÂÂÂÂÂ75.72 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ74.25 (ÂÂÂ1.94%)ÂÂÂÂÂÂÂ82.76 (ÂÂ-9.30%)
AmeanÂÂÂÂÂÂ16ÂÂÂÂÂÂ116.56 (ÂÂÂ0.00%)ÂÂÂÂÂÂ119.88 (ÂÂ-2.85%)ÂÂÂÂÂÂ164.14 ( -40.82%)
AmeanÂÂÂÂÂÂ32ÂÂÂÂÂÂ570.02 (ÂÂÂ0.00%)ÂÂÂÂÂÂ561.92 (ÂÂÂ1.42%)ÂÂÂÂÂÂ681.94 ( -19.63%)
AmeanÂÂÂÂÂÂ64ÂÂÂÂÂ3185.20 (ÂÂÂ0.00%)ÂÂÂÂÂ3291.80 (ÂÂ-3.35%)ÂÂÂÂÂ4337.43 ( -36.17%)

NOTES: networking counterpart of dbench. Varies the number of clients up to NUMCPUS*4
MEASURES: Throughput, MB/sec
HIGHER is better

machine: 8x-SKYLAKE-UMA
HmeanÂÂÂÂÂmb/sec-1ÂÂÂÂÂÂÂ620.52 (ÂÂÂ0.00%)ÂÂÂÂÂÂ613.98 *ÂÂ-1.05%*ÂÂÂÂÂÂ502.47 * -19.03%*
HmeanÂÂÂÂÂmb/sec-2ÂÂÂÂÂÂ1179.05 (ÂÂÂ0.00%)ÂÂÂÂÂ1112.84 *ÂÂ-5.62%*ÂÂÂÂÂÂ820.57 * -30.40%*
HmeanÂÂÂÂÂmb/sec-4ÂÂÂÂÂÂ2072.29 (ÂÂÂ0.00%)ÂÂÂÂÂ2040.55 *ÂÂ-1.53%*ÂÂÂÂÂ2036.11 *ÂÂ-1.75%*
HmeanÂÂÂÂÂmb/sec-8ÂÂÂÂÂÂ4238.96 (ÂÂÂ0.00%)ÂÂÂÂÂ4205.01 *ÂÂ-0.80%*ÂÂÂÂÂ4124.59 *ÂÂ-2.70%*
HmeanÂÂÂÂÂmb/sec-16ÂÂÂÂÂ3515.96 (ÂÂÂ0.00%)ÂÂÂÂÂ3536.23 *ÂÂÂ0.58%*ÂÂÂÂÂ3500.02 *ÂÂ-0.45%*
HmeanÂÂÂÂÂmb/sec-32ÂÂÂÂÂ3452.92 (ÂÂÂ0.00%)ÂÂÂÂÂ3448.94 *ÂÂ-0.12%*ÂÂÂÂÂ3428.08 *ÂÂ-0.72%*


Happy to answer any questions on the benchmarks or the methods used to
collect/report data.

Something I'd like to do now is verify that "teo"'s predictions are better
than "menu"'s; I'll probably use systemtap to make some histograms of idle
times versus what idle state was chosen -- that'd be enough to compare the

After that it would be nice to somehow know where timers came from; i.e. if
I see that residences in a given state are consistently shorter than
they're supposed to be, it would be interesting to see who set the timer
that causes the wakeup. But... I'm not sure to know how to do that :) Do
you have a strategy to track down the origin of timers/interrupts? Is there
any script you're using to evaluate teo that you can share?

Giovanni Gherdovich