Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
From: Francisco Jerez
Date: Tue Sep 04 2018 - 03:12:37 EST
Eero Tamminen <eero.t.tamminen@xxxxxxxxx> writes:
> Hi,
>
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
>> As per testing Eero Tamminen, the results are comparable to the patchset
>> https://patchwork.kernel.org/patch/10312259/
>> But he has to watch results for several days to check trend.
>
> It's close, but there is some gap compared to Francisco's version.
>
> (Because of the large variance on TDP limited devices, and factors
> causing extra perf differences e.g. between boots, it's hard to give
> exact number without having trends from several days / weeks. I would
> also need new version of Fransisco's patch set that applies to latest
> kernel like yours does.)
>
>
>> Since here boost is getting limited to turbo and non turbo, we need some
>> ways to adjust the fractions corresponding to max non turbo as well. It
>> is much easier to use the actual P-state limits for boost instead of
>> fractions. So here P-state io boost limit is applied on top of the
>> P-state limit calculated via current algorithm by removing current
>> io_wait boost calculation using fractions.
>>
>> Since we prefer to use common algorithm for all processor platforms, this
>> change was tested on other client and sever platforms as well. All results
>> were within the margin of errors. Results:
>> https://bugzilla.kernel.org/attachment.cgi?id=278149
>
> Good.
>
> Francisco, how well the listed PTS tests cover latency bound cases you
> were concerned about? [1]
>
>
> - Eero
>
> [1] Fransisco was concerned that the patch:
> * trade-off might regress latency bound cases (which I haven't tested, I
> tested only 3D throughput), and
> * that it addressed only other of the sources of energy inefficiencies
> he had identified (which could explain slightly better 3D results from
> his more complex patch set).
This patch causes a number of statistically significant regressions
(with significance of 1%) on the two systems I've tested it on. On my
CHV N3050:
| phoronix/fs-mark/test=0: XXX Â7.25% x34 -> XXX Â7.00% x39 d=-36.85% Â5.91% p=0.00%
| phoronix/sqlite: XXX Â1.86% x34 -> XXX Â1.88% x39 d=-21.73% Â1.66% p=0.00%
| warsow/benchsow: XXX Â1.25% x34 -> XXX Â1.95% x39 d=-10.83% Â1.53% p=0.00%
| phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read Performance: XXX Â1.70% x31 -> XXX Â1.02% x34 d=-7.39% Â1.36% p=0.00%
| phoronix/gtkperf/gtk-test=GtkComboBox: XXX Â1.15% x13 -> XXX Â1.59% x14 d=-5.37% Â1.35% p=0.00%
| lightsmark: XXX Â1.45% x34 -> XXX Â0.97% x41 d=-4.66% Â1.19% p=0.00%
| jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: XXX Â1.04% x31 -> XXX Â1.04% x39 d=-4.58% Â1.01% p=0.00%
| jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128: XXX Â0.12% x31 -> XXX Â0.19% x39 d=-3.60% Â0.16% p=0.00%
| dbench/client-count=1: XXX Â0.50% x34 -> XXX Â0.50% x39 d=-2.51% Â0.49% p=0.00%
On my BXT J3455:
| fs-mark/test=0: XXX Â3.04% x6 -> XXX Â3.05% x9 d=-15.96% Â2.76% p=0.00%
| sqlite: XXX Â2.54% x6 -> XXX Â2.72% x9 d=-12.42% Â2.44% p=0.00%
| dbench/client-count=1: XXX Â0.42% x6 -> XXX Â0.36% x9 d=-6.52% Â0.37% p=0.00%
| dbench/client-count=2: XXX Â0.26% x6 -> XXX Â0.33% x9 d=-5.22% Â0.29% p=0.00%
| dbench/client-count=3: XXX Â0.34% x6 -> XXX Â0.53% x9 d=-2.92% Â0.45% p=0.00%
| x11perf/test=500px Compositing From Pixmap To Window: XXX Â2.29% x16 -> XXX Â2.11% x19 d=-2.69% Â2.16% p=0.09%
| lightsmark: XXX Â0.44% x6 -> XXX Â0.33% x9 d=-1.76% Â0.37% p=0.00%
| j2dbench/rendering-test=Vector Graphics Rendering: XXX Â1.18% x16 -> XXX Â1.82% x19 d=-1.71% Â1.54% p=0.26%
| gtkperf/gtk-test=GtkComboBox: XXX Â0.37% x6 -> XXX Â0.45% x9 d=-0.95% Â0.42% p=0.08%
| jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: XXX Â0.21% x3 -> XXX Â0.23% x6 d=-0.87% Â0.22% p=0.08%
This is not surprising given that the patch is making a hard trade-off
between latency and energy efficiency without considering whether the
workload is IO- or latency-bound, which is the reason why the series I
submitted earlier [1] to address this problem implemented an IO
utilization statistic in order to determine whether the workload is
IO-bound, in which case the latency trade-off wouldn't impact
performance negatively.
Aside from that the improvement in graphics throughput seems like a
small fraction of the series [1] while TDP-bound. E.g. with this patch
on my BXT J3455:
| unigine/heaven: XXX Â0.21% x3 -> XXX Â0.19% x6 d=1.18% Â0.19% p=0.01%
| unigine/valley: XXX Â0.52% x3 -> XXX Â0.28% x6 d=1.56% Â0.37% p=0.06%
| gfxbench/gl_manhattan31: XXX Â0.12% x3 -> XXX Â0.21% x6 d=1.64% Â0.19% p=0.00%
| gfxbench/gl_trex: XXX Â0.56% x3 -> XXX Â0.36% x6 d=7.07% Â0.44% p=0.00%
vs. my series on the same system:
| gfxbench/gl_manhattan31: XXX Â0.37% x3 -> XXX Â0.08% x3 d=7.30% Â0.27% p=0.00%
| unigine/heaven: XXX Â0.47% x3 -> XXX Â0.40% x3 d=7.99% Â0.45% p=0.00%
| unigine/valley: XXX Â0.35% x3 -> XXX Â0.50% x3 d=8.24% Â0.45% p=0.00%
| gfxbench/gl_trex: XXX Â0.15% x3 -> XXX Â0.26% x3 d=9.12% Â0.23% p=0.00%
That's not surprising either considering that this patch is only
addressing one of the two reasons the current non-HWP intel_pstate
governor behaves inefficiently (see [1] for more details on the other
reason). And even that is only partially addressed since the heuristic
implemented in this patch in order to decide the degree of IOWAIT
boosting to apply can and will frequently trigger in heavily GPU-bound
cases, which will cause the task to IOWAIT on the GPU frequently,
causing the P-state controller to waste shared TDP for no benefit.
[1] https://lists.freedesktop.org/archives/intel-gfx/2018-March/160532.html
Attachment:
signature.asc
Description: PGP signature