Performance of low-cpu utilisation benchmark regressed severely since 4.6

From: Mel Gorman
Date: Mon Apr 10 2017 - 04:41:25 EST


Hi Rafael,

Since kernel 4.6, performance of the low CPU intensity workloads was dropped
severely. netperf UDP_STREAM has about 15-20% CPU utilisation has regressed
about 10% relative to 4.4 anad about 6-9% running TCP_STREAM. sockperf has
similar utilisation fixes but I won't go into these in detail as they were
running loopback and are sensitive to a lot of factors.

It's far more obvious when looking at the git test suite and the length
of time it takes to run. This is a shellscript and git intensive workload
whose CPU utilisatiion is very low but is less sensitive to multiple
factors than netperf and sockperf.

Bisection indicates that the regression started with commit ffb810563c0c
("intel_pstate: Avoid getting stuck in high P-states when idle"). However,
it's no longer the only relevant commit as the following results will show


4.4.0 4.5.0 4.6.0 4.11.0-rc5 4.11.0-rc5
vanilla vanilla vanilla vanilla revert-v1r1
User min 1786.44 ( 0.00%) 1613.72 ( 9.67%) 3302.19 (-84.85%) 3487.46 (-95.22%) 2701.84 (-51.24%)
User mean 1788.35 ( 0.00%) 1616.47 ( 9.61%) 3304.14 (-84.76%) 3488.12 (-95.05%) 2715.80 (-51.86%)
User stddev 1.43 ( 0.00%) 1.75 (-21.84%) 1.12 ( 22.10%) 0.57 ( 60.14%) 7.13 (-397.62%)
User coeffvar 0.08 ( 0.00%) 0.11 (-34.80%) 0.03 ( 57.83%) 0.02 ( 79.56%) 0.26 (-227.68%)
User max 1790.14 ( 0.00%) 1618.73 ( 9.58%) 3305.40 (-84.64%) 3489.01 (-94.90%) 2721.66 (-52.04%)
System min 218.44 ( 0.00%) 202.58 ( 7.26%) 407.51 (-86.55%) 269.92 (-23.57%) 196.85 ( 9.88%)
System mean 219.05 ( 0.00%) 203.62 ( 7.04%) 408.38 (-86.43%) 270.83 (-23.64%) 197.99 ( 9.61%)
System stddev 0.60 ( 0.00%) 0.64 ( -6.30%) 0.77 (-28.89%) 0.59 ( 1.47%) 0.87 (-44.72%)
System coeffvar 0.27 ( 0.00%) 0.31 (-14.35%) 0.19 ( 30.86%) 0.22 ( 20.31%) 0.44 (-60.11%)
System max 219.92 ( 0.00%) 204.36 ( 7.08%) 409.81 (-86.35%) 271.56 (-23.48%) 199.07 ( 9.48%)
Elapsed min 2017.05 ( 0.00%) 1827.70 ( 9.39%) 3701.00 (-83.49%) 3749.00 (-85.87%) 2904.36 (-43.99%)
Elapsed mean 2018.83 ( 0.00%) 1830.72 ( 9.32%) 3703.20 (-83.43%) 3750.20 (-85.76%) 2919.33 (-44.60%)
Elapsed stddev 1.79 ( 0.00%) 2.18 (-21.93%) 1.47 ( 17.90%) 0.75 ( 58.20%) 7.66 (-328.12%)
Elapsed coeffvar 0.09 ( 0.00%) 0.12 (-34.46%) 0.04 ( 55.24%) 0.02 ( 77.50%) 0.26 (-196.07%)
Elapsed max 2021.41 ( 0.00%) 1833.91 ( 9.28%) 3705.00 (-83.29%) 3751.00 (-85.56%) 2926.13 (-44.76%)
CPU min 99.00 ( 0.00%) 99.00 ( 0.00%) 100.00 ( -1.01%) 100.00 ( -1.01%) 99.00 ( 0.00%)
CPU mean 99.00 ( 0.00%) 99.00 ( 0.00%) 100.00 ( -1.01%) 100.00 ( -1.01%) 99.00 ( 0.00%)
CPU stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
CPU coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
CPU max 99.00 ( 0.00%) 99.00 ( 0.00%) 100.00 ( -1.01%) 100.00 ( -1.01%) 99.00 ( 0.00%)

4.4.0 4.5.0 4.6.0 4.11.0-rc5 4.11.0-rc5
vanilla vanilla vanilla vanilla revert-v1r1
User 10819.50 9790.02 19914.22 21021.12 16392.80
System 1327.78 1234.01 2465.45 1635.85 1197.03
Elapsed 12138.54 11008.49 22247.35 22528.79 17543.60

This is showing the user and system CPU usage as well as the elapsed time
to run a single iteration of the git test suite with total times at bottom
report. Overall time takes over 3 hours longer moving from 4.4 to 4.11-rc5
and reverting the commit does not fully address the problem. It's doing
a warmup run whose results are discarded and then 5 iterations.

The test shows it took 2018 seconds on average to complete a single iteration
on 4.4 and 3750 seconds to complete on 4.11-rc5. The major drop is between
4.5 and 4.6 where it went from 1830 seconds to 3703 seconds and has not
recovered. A bisection was clean and pointed to the commit mentioned above.

The results show that it's not the only source as a revert (last column)
doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
to 2919 seconds (with a revert).

The machine is a relatively old desktop-class machine with a i7-3770 CPU @
3.40GHz (IvyBridge). It is definitely using intel_pstate

analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 1.60 GHz - 3.90 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 1.60 GHz and 3.90 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency: 1.60 GHz (asserted by call to hardware)
boost state support:
Supported: yes
Active: yes
3700 MHz max turbo 4 active cores
3800 MHz max turbo 3 active cores
3900 MHz max turbo 2 active cores
3900 MHz max turbo 1 active cores

No special boot parameters are specified.

I didn't poke around too much as the last time I tried, there were too
many conflicting opinions and requirements so here are the observations.

CPU usage is roughly 10% for the full duratiion of the test.
Context switches, interrupt activity is not altered by the revert although it has changed substantially since 4.4
turbostat confirms that busy time is roughtly 10% across the whole machine
turbostat shows that average MHz is roughly halved in 4.11-rc5-vanilla versus 4.4
turbostat shows that average MHz is slightly higher with the revert applied
benchmark in question is doing IO but not a lot. Mostly below 100K/sec writes with small bursts of 6000K/sec

CONFIG_CPU_FREQ_GOV_SCHEDUTIL is *NOT* set. This is deliberate as when
I evaluated schedutil shortly after it was merged, I found that at best
it performed comparably with the old code across a range of workloads
and machines while having higher system CPU usage. I know a lot of
the recent work has been schedutil-focused so I could find no patch on
recent discussions that might relevant to this problem. I've not looked
at schedutil recently but not everyone will be switching to it so the old
setup is still relevant.

While I accept the logic that CPUs should not remain at the highest
frequency if completely idle for prolonged periods of time, it appears to
be too agressive on older CPUs. Low utilisation tasks should still be able
to get to the higher frequencies for the short bursts they are active for.

I hope the data and the bisection is enough to have some ideas on how
it can be addressed without impacting Haswell and Jorg's setup that the
commit was originally intended for.

--
Mel Gorman
SUSE Labs