RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6

From: Doug Smythies
Date: Thu Apr 20 2017 - 10:56:10 EST


On 2017.04.19 01:16 Mel Gorman wrote:
> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>> Hi Mel,
>>
>> Thanks for the "how to" information.
>> This is a very interesting use case.
>> From trace data, I see a lot of minimal durations with
>> virtually no load on the CPU, typically more consistent
>> with some type of light duty periodic (~~100 Hz) work flow
>> (where we would prefer to not ramp up frequencies, or more
>> accurately keep them ramped up).
>
> This broadly matches my expectations in terms of behaviour. It is a
> low duty workload but while I accept that a laptop may not want the
> frequencies to ramp up, it's not universally true.

Agreed.

> Long periods at low
> frequency to complete a workload is not necessarily better than using a
> high frequency to race to idle.

Agreed, but it is processor dependant. For example with my older
i7-2700k processor I get the following package energies for
one loop (after the throw away loop) of the test (method 1):

intel_cpu-freq, powersave (lowest energy reference) 5876 Joules
intel_cpu-freq, conservative 5927 Joules
intel_cpu-freq, ondemand 6525 Joules
intel_cpu_freq, schedutil 6049 Joules
, performance (highest energy reference) 8105 Joules
intel_pstate, powersave 7044 Joules
intel_pstate, force the load based algorithm 6390 Joules

> Effectively, a low utilisation test suite
> could be considered as a "foreground task of high priority" and not a
> "background task of little interest".

I wouldn't know how to make the distinction.

>> My results (further below) are different than yours, sometimes
>> dramatically, but the trends are similar.
>
> It's inevitable there would be some hardware based differences. The
> machine I have appears to show an extreme case.

Agreed.

>> I have nothing to add about the control algorithm over what
>> Rafael already said.
>>
>> On 2017.04.11 09:42 Mel Gorman wrote:
>>> On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
>>>> On 2017.04.11 03:03 Mel Gorman wrote:
>>>>>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
>>>>>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
>>>>>>>
>>>>>>> It's far more obvious when looking at the git test suite and the length
>>>>>>> of time it takes to run. This is a shellscript and git intensive workload
>>>>>>> whose CPU utilisatiion is very low but is less sensitive to multiple
>>>>>>> factors than netperf and sockperf.
>>>>>>
>>>>
>>>> I would like to repeat your tests on my test computer (i7-2600K).
>>>> I am not familiar with, and have not been able to find,
>>>> "the git test suite" shellscript. Could you point me to it?
>>>>
>>>
>>> If you want to use git source directly do a checkout from
>>> https://github.com/git/git and build it. The core "benchmark" is make
>>> test and timing it.
>>
>> Because I had troubles with your method further below, I also did
>> this method. I did 5 runs, after a throw away run, similar to what
>> you do (and I could see the need for a throw away pass).
>>
>
> Yeah, at the very least IO effects should be eliminated.
>
>> Results (there is something wrong with user and system times and CPU%
>> in kernel 4.5, so I only calculated Elapsed differences):
>>
>
> In case it matters, the User and System CPU times are reported as standard
> for these classes of workload by mmtests even though it's not necessarily
> universally interesting. Generally, I consider the elapsed time to
> be the most important but often, a major change in system CPU time is
> interesting. That's not universally true as there have been changes in how
> system CPU is calculated in the past and it's sensitive to Kconfig options
> with VIRT_CPU_ACCOUNTING_GEN being a notable source of confusion in the past.
>
>> Linux s15 4.5.0-stock #232 SMP Tue Apr 11 23:54:49 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>> ... test_run: start 5 runs ...
>> 327.04user 122.08system 33:57.81elapsed (2037.81 : reference) 22%CPU
>> ... test_run: done ...
>>
>> Linux s15 4.11.0-rc6-stock #231 SMP Mon Apr 10 08:29:29 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>>
>> intel_pstate - powersave
>> ... test_run: start 5 runs ...
>> 1518.71user 552.87system 39:24.45elapsed (2364.45 : -16.03%) 87%CPU
>> ... test_run: done ...
>>
>> intel_pstate - performance (fast reference)
>> ... test_run: start 5 runs ...
>> 1160.52user 291.33system 29:36.05elapsed (1776.05 : 12.85%) 81%CPU
>> ... test_run: done ...
>>
>> intel_cpufreq - powersave (slow reference)
>> ... test_run: start 5 runs ...
>> 2165.72user 1049.18system 57:12.77elapsed (3432.77 : -68.45%) 93%CPU
>> ... test_run: done ...
>>
>> intel_cpufreq - ondemand
>> ... test_run: start 5 runs ...
>> 1776.79user 808.65system 47:14.74elapsed (2834.74 : -39.11%) 91%CPU
>>
>
> Nothing overly surprising there. It's been my observation that pstate is
> generally better than acpi_cpufreq which somewhat amuses me when I still
> see suggestions of disabling intel_pstate entirely being used despite the
> advice being based on much older kernels.
>
>> intel_cpufreq - schedutil
>> ... test_run: start 5 runs ...
>> 2049.28user 1028.70system 54:57.82elapsed (3297.82 : -61.83%) 93%CPU
>> ... test_run: done ...
>>
>
> I'm mildly surprised at this. I had observed that schedutil is not great
> but I don't recall seeing a result this bad.
>
>> Linux s15 4.11.0-rc6-revert #233 SMP Wed Apr 12 15:30:19 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>> ... test_run: start 5 runs ...
>> 1295.30user 365.98system 32:50.15elapsed (1970.15 : 3.32%) 84%CPU
>> ... test_run: done ...
>>
>
> And the revert does help albeit not being an option for reasons Rafael
> covered.

New data point: Kernel 4.11-rc7 intel_pstate, powersave forcing the
load based algorithm: Elapsed 3178 seconds.

If I understand your data correctly, my load based results are the opposite of yours.

Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
Or: 33.25%

Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
Or: -34.4%

>>> The way I'm doing it is via mmtests so
>>>
>>> git clone https://github.com/gormanm/mmtests
>>> cd mmtests
>>> ./run-mmtests --no-monitor --config configs/config-global-dhp__workload_shellscripts test-run-1
>>> cd work/log
>>> ../../compare-kernels.sh | less
>>>
>>> and it'll generate a similar report to what I posted in this email
>>> thread. If you do multiple tests with different kernels then change the
>>> name of "test-run-1" to preserve the old data. compare-kernel.sh will
>>> compare whatever results you have.
>>
>> k4.5 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6
>> performance pass-ps pass-od pass-su revert
>> E min 388.71 456.51 (-17.44%) 342.81 ( 11.81%) 668.79 (-72.05%) 552.85 (-42.23%) 646.96 (-66.44%) 375.08 ( 3.51%)
>> E mean 389.74 458.52 (-17.65%) 343.81 ( 11.78%) 669.42 (-71.76%) 553.45 (-42.01%) 647.95 (-66.25%) 375.98 ( 3.53%)
>> E stddev 0.85 1.64 (-92.78%) 0.67 ( 20.83%) 0.41 ( 52.25%) 0.31 ( 64.00%) 0.68 ( 20.35%) 0.46 ( 46.00%)
>> E coeffvar 0.22 0.36 (-63.86%) 0.20 ( 10.25%) 0.06 ( 72.20%) 0.06 ( 74.65%) 0.10 ( 52.09%) 0.12 ( 44.03%)
>> E max 390.90 461.47 (-18.05%) 344.83 ( 11.79%) 669.91 (-71.38%) 553.68 (-41.64%) 648.75 (-65.96%) 376.37 ( 3.72%)
>>
>> E = Elapsed (squished in an attempt to prevent line length wrapping when I send)
>>
>> k4.5 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6
>> performance pass-ps pass-od pass-su revert
>> User 347.26 1801.56 1398.76 2540.67 2106.30 2434.06 1536.80
>> System 139.01 701.87 366.59 1346.75 1026.67 1322.39 449.81
>> Elapsed 2346.77 2761.20 2062.12 4017.47 3321.10 3887.19 2268.90
>>
>> Legend:
>> blank = active mode: intel_pstate - powersave
>> performance = active mode: intel_pstate - performance (fast reference)
>> pass-ps = passive mode: intel_cpufreq - powersave (slow reference)
>> pass-od = passive mode: intel_cpufreq - ondemand
>> pass-su = passive mode: intel_cpufreq - schedutil
>> revert = active mode: intel_pstate - powersave with commit ffb810563c0c reverted.
>>
>> I deleted the user, system, and CPU rows, because they don't make any sense.
>>
>
> User is particularly misleading. System can be very misleading between
> kernel versions due to accounting differences so I'm ok with that.
>
>> I do not know why the tests run overall so much faster on my computer,
>
> Differences in CPU I imagine. I know the machine I'm reporting on is a
> particularly bad example. I've seen other machines where the effect is
> less severe.

No, I meant that my overall run time was on the order of 3/4 of an hour,
whereas your tests were on the order of 3 hours. As far as I could tell,
our CPUs had similar capabilities.

>
>> I can only assume I have something wrong in my installation of your mmtests.
>
> No, I've seen results broadly similar to yours on other machines so I
> don't think you have a methodology error.
>
>> I do see mmtests looking for some packages which it can not find.
>>
>
> That's not too unusual. The package names are based on opensuse naming
> and that doesn't translate to other distributions. If you open
> bin/install-depends, you'll see a hashmap near the top that maps some of
> the names for redhat-based distributions and debian. It's not actively
> maintained. You can either install the packages manaually before the
> test or update the mappings.

>> Mel wrote:
>>> The results show that it's not the only source as a revert (last column)
>>> doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
>>> to 2919 seconds (with a revert).
>>
>> In my case, the reverted code ran faster than the kernel 4.5 code.
>>
>> The other big difference is between Kernel 4.5 and 4.11-rc5 you got
>> -102.28% elapsed time, whereas I got -16.03% with method 1 and
>> -17.65% with method 2 (well, between 4.5 and 4.11-rc6 in my case).
>> I only get -93.28% and -94.82% difference between my fast and slow reference
>> tests (albeit on the same kernel).
>>
>
> I have no reason to believe this is a methodology error and is due to a
> difference in CPU. Consider the following reports
>
>
http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/delboy/#gitsource
> http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/ivy/#gitsource
>
> The first one (delboy) shows a gain of 1.35% and it's only for 4.11
> (kernel shown is 4.11-rc1 with vmscan-related patches on top that do not
> affect this test case) of -17.51% which is very similar to yours. The
> CPU there is a Xeon E3-1230 v5.
>
> The second report (ivy) is the machine I'm based the original complain
> on and shows the large regression in elapsed time.
>
> So, different CPUs have different behaviours which is no surprise at all
> considering that at the very least, exit latencies will be different.
> While there may not be a universally correct answer to how to do this
> automatically, is it possible to tune intel_pstate such that it ramps up
> quickly regardless of recent utilisation and reduces relatively slowly?
> That would be better from a power consumption perspective than setting the
> "performance" governor.

As mentioned above, I don't know how to make the distinction in the use
cases.

... Doug