Re: [PATCH 0/4] Intel_pstate: HWP Dynamic performance boost
From: Rafael J. Wysocki
Date: Tue Jun 12 2018 - 11:05:17 EST
On Tuesday, June 5, 2018 11:42:38 PM CEST Srinivas Pandruvada wrote:
> v1 (Compared to RFC/RFT v3)
> - Minor suggestion for intel_pstate for coding
> - Add SKL desktop model used in some Xeons
>
> Tested-by: Giovanni Gherdovich <ggherdovich@xxxxxxx>
>
> This series has an overall positive performance impact on IO both on xfs and
> ext4, and I'd be vary happy if it lands in v4.18. You dropped the migration
> optimization from v1 to v2 after the reviewers' suggestion; I'm looking
> forward to test that part too, so please add me to CC when you'll resend it.
>
> I've tested your series on a single socket Xeon E3-1240 v5 (Skylake, 4 cores /
> 8 threads) with SSD storage. The platform is a Dell PowerEdge R230.
>
> The benchmarks used are a mix of I/O intensive workloads on ext4 and xfs
> (dbench4, sqlite, pgbench in read/write and read-only configuration, Flexible
> IO aka FIO, etc) and scheduler stressers just to check that everything is okay
> in that department too (hackbench, pipetest, schbench, sockperf on localhost
> both in "throughput" and "under-load" mode, netperf in localhost, etc). There
> is also some HPC with the NAS Parallel Benchmark, as when using openMPI as IPC
> mechanism it ends up being write-intensive and that could be a good
> experiment, even if the HPC people aren't exactly the target audience for a
> frequency governor.
>
> The large improvements are in areas you already highlighted in your cover
> letter (dbench4, sqlite, and pgbench read/write too, very impressive
> honestly). Minor wins are also observed in sockperf and running the git unit
> tests (gitsource below). The scheduler stressers ends up, as expected, in the
> "neutral" category where you'll also find FIO (which given other results I'd
> have expected to improve a little at least). Marked "neutral" are also those
> results where statistical significance wasn't reached (2 standard deviations,
> which is roughly like a 0.05 p-value) even if they showed some difference in a
> direction or the other. In the "small losses" section I found hackbench run
> with processes (not threads) and pipes (not sockets) which I report for due
> diligence but looking at the raw numbers it's more of a mixed bag than a real
> loss, and the NAS high-perf computing benchmark when it uses openMP (as
> opposed to openMPI) for IPC -- but again, we often find that supercomputers
> people run the machines at full speed all the time.
>
> At the bottom of this message you'll find some directions if you want to run
> some test yourself using the same framework I used, MMTests from
> https://github.com/gormanm/mmtests (we store a fair amount of benchmarks
> parametrization up there).
>
> Large wins:
>
> - dbench4: +20% on ext4,
> +14% on xfs (always asynch IO)
> - sqlite (insert): +9% on both ext4 and xfs
> - pgbench (read/write): +9% on ext4,
> +10% on xfs
>
> Moderate wins:
>
> - sockperf (type: under-load, localhost): +1% with TCP,
> +5% with UDP
> - gisource (git unit tests, shell intensive): +3% on ext4
> - NAS Parallel Benchmark (HPC, using openMPI, on xfs): +1%
> - tbench4 (network part of dbench4, localhost): +1%
>
> Neutral:
>
> - pgbench (read-only) on ext4 and xfs
> - siege
> - netperf (streaming and round-robin) with TCP and UDP
> - hackbench (sockets/process, sockets/thread and pipes/thread)
> - pipetest
> - Linux kernel build
> - schbench
> - sockperf (type: throughput) with TCP and UDP
> - git unit tests on xfs
> - FIO (both random and seq. read, both random and seq. write)
> on ext4 and xfs, async IO
>
> Moderate losses:
>
> - hackbench (pipes/process): -10%
> - NAS Parallel Benchmark with openMP: -1%
>
>
> Each benchmark is run with a variety of configuration parameters (eg: number
> of threads, number of clients, etc); to reach a final "score" the geometric
> mean is used (with a few exceptions depending on the type of benchmark).
> Detailed results follow. Amean, Hmean and Gmean are respectively arithmetic,
> harmonic and geometric means.
>
> For brevity I won't report all tables but only those for "large wins" and
> "moderate losses". Note that I'm not overly worried for the hackbench-pipes
> situation, as we've studied it in the past and determined that such
> configuration is particularly weak, time is mostly spent on contention and the
> scheduler code path isn't exercised. See the comment in the file
> configs/config-global-dhp__scheduler-unbound in MMTests for a brief
> description of the issue.
>
> DBENCH4
> =======
>
> NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
> MMTESTS CONFIG: global-dhp__io-dbench4-async-{ext4, xfs}
> MEASURES: latency (millisecs)
> LOWER is better
>
> EXT4
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean 1 28.49 ( 0.00%) 19.68 ( 30.92%)
> Amean 2 26.70 ( 0.00%) 25.59 ( 4.14%)
> Amean 4 54.59 ( 0.00%) 43.56 ( 20.20%)
> Amean 8 91.19 ( 0.00%) 77.56 ( 14.96%)
> Amean 64 538.09 ( 0.00%) 438.67 ( 18.48%)
> Stddev 1 6.70 ( 0.00%) 3.24 ( 51.66%)
> Stddev 2 4.35 ( 0.00%) 3.57 ( 17.85%)
> Stddev 4 7.99 ( 0.00%) 7.24 ( 9.29%)
> Stddev 8 17.51 ( 0.00%) 15.80 ( 9.78%)
> Stddev 64 49.54 ( 0.00%) 46.98 ( 5.17%)
>
> XFS
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean 1 21.88 ( 0.00%) 16.03 ( 26.75%)
> Amean 2 19.72 ( 0.00%) 19.82 ( -0.50%)
> Amean 4 37.55 ( 0.00%) 29.52 ( 21.38%)
> Amean 8 56.73 ( 0.00%) 51.83 ( 8.63%)
> Amean 64 808.80 ( 0.00%) 698.12 ( 13.68%)
> Stddev 1 6.29 ( 0.00%) 2.33 ( 62.99%)
> Stddev 2 3.12 ( 0.00%) 2.26 ( 27.73%)
> Stddev 4 7.56 ( 0.00%) 5.88 ( 22.28%)
> Stddev 8 14.15 ( 0.00%) 12.49 ( 11.71%)
> Stddev 64 380.54 ( 0.00%) 367.88 ( 3.33%)
>
> SQLITE
> ======
>
> NOTES: SQL insert test on a table that will be 2M in size.
> MMTESTS CONFIG: global-dhp__db-sqlite-insert-medium-{ext4, xfs}
> MEASURES: transactions per second
> HIGHER is better
>
> EXT4
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean Trans 2098.79 ( 0.00%) 2292.16 ( 9.21%)
> Stddev Trans 78.79 ( 0.00%) 95.73 ( -21.50%)
>
> XFS
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean Trans 1890.27 ( 0.00%) 2058.62 ( 8.91%)
> Stddev Trans 52.54 ( 0.00%) 29.56 ( 43.73%)
>
> PGBENCH-RW
> ==========
>
> NOTES: packaged with Postgres. Varies the number of thread up to NUMCPUS. The
> workload is scaled so that the approximate size is 80% of of the database
> shared buffer which itself is 20% of RAM. The page cache is not flushed
> after the database is populated for the test and starts cache-hot.
> MMTESTS CONFIG: global-dhp__db-pgbench-timed-rw-small-{ext4, xfs}
> MEASURES: transactions per second
> HIGHER is better
>
> EXT4
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean 1 2692.19 ( 0.00%) 2660.98 ( -1.16%)
> Hmean 4 5218.93 ( 0.00%) 5610.10 ( 7.50%)
> Hmean 7 7332.68 ( 0.00%) 8378.24 ( 14.26%)
> Hmean 8 7462.03 ( 0.00%) 8713.36 ( 16.77%)
> Stddev 1 231.85 ( 0.00%) 257.49 ( -11.06%)
> Stddev 4 681.11 ( 0.00%) 312.64 ( 54.10%)
> Stddev 7 1072.07 ( 0.00%) 730.29 ( 31.88%)
> Stddev 8 1472.77 ( 0.00%) 1057.34 ( 28.21%)
>
> XFS
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean 1 2675.02 ( 0.00%) 2661.69 ( -0.50%)
> Hmean 4 5049.45 ( 0.00%) 5601.45 ( 10.93%)
> Hmean 7 7302.18 ( 0.00%) 8348.16 ( 14.32%)
> Hmean 8 7596.83 ( 0.00%) 8693.29 ( 14.43%)
> Stddev 1 225.41 ( 0.00%) 246.74 ( -9.46%)
> Stddev 4 761.33 ( 0.00%) 334.77 ( 56.03%)
> Stddev 7 1093.93 ( 0.00%) 811.30 ( 25.84%)
> Stddev 8 1465.06 ( 0.00%) 1118.81 ( 23.63%)
>
> HACKBENCH
> =========
>
> NOTES: Varies the number of groups between 1 and NUMCPUS*4
> MMTESTS CONFIG: global-dhp__scheduler-unbound
> MEASURES: time (seconds)
> LOWER is better
>
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean 1 0.8350 ( 0.00%) 1.1577 ( -38.64%)
> Amean 3 2.8367 ( 0.00%) 3.7457 ( -32.04%)
> Amean 5 6.7503 ( 0.00%) 5.7977 ( 14.11%)
> Amean 7 7.8290 ( 0.00%) 8.0343 ( -2.62%)
> Amean 12 11.0560 ( 0.00%) 11.9673 ( -8.24%)
> Amean 18 15.2603 ( 0.00%) 15.5247 ( -1.73%)
> Amean 24 17.0283 ( 0.00%) 17.9047 ( -5.15%)
> Amean 30 19.9193 ( 0.00%) 23.4670 ( -17.81%)
> Amean 32 21.4637 ( 0.00%) 23.4097 ( -9.07%)
> Stddev 1 0.0636 ( 0.00%) 0.0255 ( 59.93%)
> Stddev 3 0.1188 ( 0.00%) 0.0235 ( 80.22%)
> Stddev 5 0.0755 ( 0.00%) 0.1398 ( -85.13%)
> Stddev 7 0.2778 ( 0.00%) 0.1634 ( 41.17%)
> Stddev 12 0.5785 ( 0.00%) 0.1030 ( 82.19%)
> Stddev 18 1.2099 ( 0.00%) 0.7986 ( 33.99%)
> Stddev 24 0.2057 ( 0.00%) 0.7030 (-241.72%)
> Stddev 30 1.1303 ( 0.00%) 0.7654 ( 32.28%)
> Stddev 32 0.2032 ( 0.00%) 3.1626 (-1456.69%)
>
> NAS PARALLEL BENCHMARK, C-CLASS (w/ openMP)
> ===========================================
>
> NOTES: The various computational kernels are run separately; see
> https://www.nas.nasa.gov/publications/npb.html for the list of tasks (IS =
> Integer Sort, EP = Embarrassingly Parallel, etc)
> MMTESTS CONFIG: global-dhp__nas-c-class-omp-full
> MEASURES: time (seconds)
> LOWER is better
>
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean bt.C 169.82 ( 0.00%) 170.54 ( -0.42%)
> Stddev bt.C 1.07 ( 0.00%) 0.97 ( 9.34%)
> Amean cg.C 41.81 ( 0.00%) 42.08 ( -0.65%)
> Stddev cg.C 0.06 ( 0.00%) 0.03 ( 48.24%)
> Amean ep.C 26.63 ( 0.00%) 26.47 ( 0.61%)
> Stddev ep.C 0.37 ( 0.00%) 0.24 ( 35.35%)
> Amean ft.C 38.17 ( 0.00%) 38.41 ( -0.64%)
> Stddev ft.C 0.33 ( 0.00%) 0.32 ( 3.78%)
> Amean is.C 1.49 ( 0.00%) 1.40 ( 6.02%)
> Stddev is.C 0.20 ( 0.00%) 0.16 ( 19.40%)
> Amean lu.C 217.46 ( 0.00%) 220.21 ( -1.26%)
> Stddev lu.C 0.23 ( 0.00%) 0.22 ( 0.74%)
> Amean mg.C 18.56 ( 0.00%) 18.80 ( -1.31%)
> Stddev mg.C 0.01 ( 0.00%) 0.01 ( 22.54%)
> Amean sp.C 293.25 ( 0.00%) 296.73 ( -1.19%)
> Stddev sp.C 0.10 ( 0.00%) 0.06 ( 42.67%)
> Amean ua.C 170.74 ( 0.00%) 172.02 ( -0.75%)
> Stddev ua.C 0.28 ( 0.00%) 0.31 ( -12.89%)
>
> HOW TO REPRODUCE
> ================
>
> To install MMTests, clone the git repo at
> https://github.com/gormanm/mmtests.git
>
> To run a config (ie a set of benchmarks, such as
> config-global-dhp__nas-c-class-omp-full), use the command
> ./run-mmtests.sh --config configs/$CONFIG $MNEMONIC-NAME
> from the top-level directory; the benchmark source will be downloaded from its
> canonical internet location, compiled and run.
>
> To compare results from two runs, use
> ./bin/compare-mmtests.pl --directory ./work/log \
> --benchmark $BENCHMARK-NAME \
> --names $MNEMONIC-NAME-1,$MNEMONIC-NAME-2
> from the top-level directory.
>
> ==================
> From RFC Series:
> v3
> - Removed atomic bit operation as suggested.
> - Added description of contention with user space.
> - Removed hwp cache, boost utililty function patch and merged with util callback
> patch. This way any value set is used somewhere.
>
> Waiting for test results from Mel Gorman, who is the original reporter.
>
> v2
> This is a much simpler version than the previous one and only consider IO
> boost, using the existing mechanism. There is no change in this series
> beyond intel_pstate driver.
>
> Once PeterZ finishes his work on frequency invariant, I will revisit
> thread migration optimization in HWP mode.
>
> Other changes:
> - Gradual boost instead of single step as suggested by PeterZ.
> - Cross CPU synchronization concerns identified by Rafael.
> - Split the patch for HWP MSR value caching as suggested by PeterZ.
>
> Not changed as suggested:
> There is no architecture way to identify platform with Per-core
> P-states, so still have to enable feature based on CPU model.
>
> -----------
> v1
>
> This series tries to address some concern in performance particularly with IO
> workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave
> policy.
>
> Background
> HWP performance can be controlled by user space using sysfs interface for
> max/min frequency limits and energy performance preference settings. Based on
> workload characteristics these can be adjusted from user space. These limits
> are not changed dynamically by kernel based on workload.
>
> By default HWP defaults to energy performance preference value of 0x80 on
> majority of platforms(Scale is 0-255, 0 is max performance and 255 is min).
> This value offers best performance/watt and for majority of server workloads
> performance doesn't suffer. Also users always have option to use performance
> policy of intel_pstate, to get best performance. But user tend to run with
> out of box configuration, which is powersave policy on most of the distros.
>
> In some case it is possible to dynamically adjust performance, for example,
> when a CPU is woken up due to IO completion or thread migrate to a new CPU. In
> this case HWP algorithm will take some time to build utilization and ramp up
> P-states. So this may results in lower performance for some IO workloads and
> workloads which tend to migrate. The idea of this patch series is to
> temporarily boost performance dynamically in these cases. This is only
> applicable only when user is using powersave policy, not in performance policy.
>
> Results on a Skylake server:
>
> Benchmark Improvement %
> ----------------------------------------------------------------------
> dbench 50.36
> thread IO bench (tiobench) 10.35
> File IO 9.81
> sqlite 15.76
> X264 -104 cores 9.75
>
> Spec Power (Negligible impact 7382 Vs. 7378)
> Idle Power No change observed
> -----------------------------------------------------------------------
>
> HWP brings in best performace/watt at EPP=0x80. Since we are boosting
> EPP here to 0, the performance/watt drops upto 10%. So there is a power
> penalty of these changes.
>
> Also Mel Gorman provided test results on a prior patchset, which shows
> benifits of this series.
>
> Srinivas Pandruvada (4):
> cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
> cpufreq: intel_pstate: HWP boost performance on IO wakeup
> cpufreq: intel_pstate: New sysfs entry to control HWP boost
> cpufreq: intel_pstate: enable boost for Skylake Xeon
>
> drivers/cpufreq/intel_pstate.c | 179 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 176 insertions(+), 3 deletions(-)
>
>
Applied, thanks!