Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

From: Giovanni Gherdovich
Date: Sat Dec 08 2018 - 05:19:08 EST

Hello Doug,

sorry for the late reply, this week I was traveling.

First off, thank you for trying out MMTests; I admit the documentation is
somewhat incomplete. I'm going to give you an overview of how I run benchmarks
with MMTests and how do I print comparisons, hoping this can address your

In the last report I posted the following two tables, for instance; I'll now
show the commands I used to produce them.

>ÂÂ* sockperf on loopback over UDP, mode "throughput"
>ÂÂÂÂÂ* global-dhp__network-sockperf-unbound
>ÂÂÂÂÂ48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
>ÂÂÂ8x-SKYLAKE-UMAÂÂÂÂÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂ10% better
>ÂÂÂ80x-BROADWELL-NUMAÂÂÂÂ3% betterÂÂÂ2% betterÂÂÂ5% betterÂÂÂ3% worseÂÂÂÂ8% better
>ÂÂÂ48x-HASWELL-NUMAÂÂÂÂÂÂ4% betterÂÂÂ12% worseÂÂÂno changeÂÂÂno changeÂÂÂno change
>ÂÂÂNOTES: Test run in mode "throughput" over UDP. The varying parameter is the
>ÂÂÂÂÂÂÂmessage size.
>ÂÂÂMEASURES: Throughput, in MBits/second
>ÂÂÂHIGHER is better
>ÂÂÂmachine: 8x-SKYLAKE-UMA
>Â ÂÂÂÂÂÂÂvanillaÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteoÂÂÂÂÂÂÂÂteo-v2+backportÂÂÂÂÂÂÂÂteo-v3+backportÂÂÂÂÂÂÂÂteo-v5+backportÂÂÂÂÂÂÂÂteo-v6+backport
>ÂÂÂHmeanÂÂÂÂÂ14ÂÂÂÂÂÂÂÂ70.34 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ69.80 *ÂÂ-0.76%*ÂÂÂÂÂÂÂ69.11 *ÂÂ-1.75%*ÂÂÂÂÂÂÂ69.49 *ÂÂ-1.20%*ÂÂÂÂÂÂÂ69.71 *ÂÂ-0.90%*ÂÂÂÂÂÂÂ77.51 *ÂÂ10.20%*
>ÂÂÂHmeanÂÂÂÂÂ100ÂÂÂÂÂÂ499.24 (ÂÂÂ0.00%)ÂÂÂÂÂÂ494.26 *ÂÂ-1.00%*ÂÂÂÂÂÂ492.74 *ÂÂ-1.30%*ÂÂÂÂÂÂ494.90 *ÂÂ-0.87%*ÂÂÂÂÂÂ497.43 *ÂÂ-0.36%*ÂÂÂÂÂÂ549.93 *ÂÂ10.15%*
>ÂÂÂHmeanÂÂÂÂÂ300ÂÂÂÂÂ1489.13 (ÂÂÂ0.00%)ÂÂÂÂÂ1472.39 *ÂÂ-1.12%*ÂÂÂÂÂ1468.45 *ÂÂ-1.39%*ÂÂÂÂÂ1477.74 *ÂÂ-0.76%*ÂÂÂÂÂ1478.61 *ÂÂ-0.71%*ÂÂÂÂÂ1632.63 *ÂÂÂ9.64%*
>ÂÂÂHmeanÂÂÂÂÂ500ÂÂÂÂÂ2469.62 (ÂÂÂ0.00%)ÂÂÂÂÂ2444.41 *ÂÂ-1.02%*ÂÂÂÂÂ2434.61 *ÂÂ-1.42%*ÂÂÂÂÂ2454.15 *ÂÂ-0.63%*ÂÂÂÂÂ2454.76 *ÂÂ-0.60%*ÂÂÂÂÂ2698.70 *ÂÂÂ9.28%*
>ÂÂÂHmeanÂÂÂÂÂ850ÂÂÂÂÂ4165.12 (ÂÂÂ0.00%)ÂÂÂÂÂ4123.82 *ÂÂ-0.99%*ÂÂÂÂÂ4100.37 *ÂÂ-1.55%*ÂÂÂÂÂ4111.82 *ÂÂ-1.28%*ÂÂÂÂÂ4120.04 *ÂÂ-1.08%*ÂÂÂÂÂ4521.11 *ÂÂÂ8.55%*

The first table is a panoramic view of all machines, the second is a zoom into
the 8x-SKYLAKE-UMA machine where the overall benchmark score is broken down
into the various message sizes.

The first thing to do is, obviously, to gather data for each kernel. Once the
kernel is installed on the box, as you already figured out, you have to run:

 ./ --config configs/config-global-dhp__network-sockperf-unbound SOME-MNEMONIC-NAME

In my case, what I did is to run:

 # build, install and boot 4.18.0-vanilla kernel
 ./ --config configs/config-global-dhp__network-sockperf-unboundÂÂ4.18.0-vanilla

 # build, install and boot 4.18.0-teo kernel
 ./ --config configs/config-global-dhp__network-sockperf-unboundÂÂ4.18.0-teo

 # build, install and boot 4.18.0-teo-v2+backport kernel
 ./ --config configs/config-global-dhp__network-sockperf-unboundÂÂ4.18.0-teo-v2+backport


 # build, install and boot 4.18.0-teo-v6+backport kernel
 ./ --config configs/config-global-dhp__network-sockperf-unboundÂÂ4.18.0-teo-v6+backport

At this point in the work/log directory I've accumulated all the data I need
for a report. What's important to note here is that a single configuration
file (such as config-global-dhp__network-sockperf-unbound) often runs more than
a singleÂÂbenchmark, according to the value of the MMTESTS variable in that
config. The config we're using has:

 export MMTESTS="sockperf-tcp-throughput sockperf-tcp-under-load sockperf-udp-throughput sockperf-udp-under-load"

which means it's running 4 different flavors of sockperf. The two tables above
are from the "sockperf-udp-throughput" variant.

Now that we've run the benchmarks for each kernel (every run takes around 75
minutes on my machines) we're ready to extract some comparison tables.
Exploring the work/log directory shows what we've got:

 $ find . -type d -name sockperf\* | sortÂ

Above you see a directory for each of the benchmarks in the MMTESTS variable
from before and for each kernel patch. The command to get the detailed table
at the top of this message is:

 $ ../../bin/ --directory . \
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ--benchmark sockperf-udp-throughput \
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ--names 4.18.0-vanilla,4.18.0-teo,4.18.0-teo-v2+backport,4.18.0-teo-v3+backport,4.18.0-teo-v5+backport,4.18.0-teo-v6+backport \
ÂÂÂ| grep '^.mean'

 HmeanÂÂÂÂÂ14ÂÂÂÂÂÂÂÂ70.34 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ69.80 *ÂÂ-0.76%*ÂÂÂÂÂÂÂ69.11 *ÂÂ-1.75%*ÂÂÂÂÂÂÂ69.49 *ÂÂ-1.20%*ÂÂÂÂÂÂÂ69.71 *ÂÂ-0.90%*ÂÂÂÂÂÂÂ77.51 *ÂÂ10.20%*
 HmeanÂÂÂÂÂ100ÂÂÂÂÂÂ499.24 (ÂÂÂ0.00%)ÂÂÂÂÂÂ494.26 *ÂÂ-1.00%*ÂÂÂÂÂÂ492.74 *ÂÂ-1.30%*ÂÂÂÂÂÂ494.90 *ÂÂ-0.87%*ÂÂÂÂÂÂ497.43 *ÂÂ-0.36%*ÂÂÂÂÂÂ549.93 *ÂÂ10.15%*
 HmeanÂÂÂÂÂ300ÂÂÂÂÂ1489.13 (ÂÂÂ0.00%)ÂÂÂÂÂ1472.39 *ÂÂ-1.12%*ÂÂÂÂÂ1468.45 *ÂÂ-1.39%*ÂÂÂÂÂ1477.74 *ÂÂ-0.76%*ÂÂÂÂÂ1478.61 *ÂÂ-0.71%*ÂÂÂÂÂ1632.63 *ÂÂÂ9.64%*
 HmeanÂÂÂÂÂ500ÂÂÂÂÂ2469.62 (ÂÂÂ0.00%)ÂÂÂÂÂ2444.41 *ÂÂ-1.02%*ÂÂÂÂÂ2434.61 *ÂÂ-1.42%*ÂÂÂÂÂ2454.15 *ÂÂ-0.63%*ÂÂÂÂÂ2454.76 *ÂÂ-0.60%*ÂÂÂÂÂ2698.70 *ÂÂÂ9.28%*
 HmeanÂÂÂÂÂ850ÂÂÂÂÂ4165.12 (ÂÂÂ0.00%)ÂÂÂÂÂ4123.82 *ÂÂ-0.99%*ÂÂÂÂÂ4100.37 *ÂÂ-1.55%*ÂÂÂÂÂ4111.82 *ÂÂ-1.28%*ÂÂÂÂÂ4120.04 *ÂÂ-1.08%*ÂÂÂÂÂ4521.11 *ÂÂÂ8.55%*

As you can see I'm grepping for a specific field in the table, i.e. the mean
values; some benchmarks use the harmonic mean (Hmean) and some use the
arithmetic mean (Amean). I also use to get a peek at the beginning of the table
without grepping to get nice headers to stick on top. See how the table above
only uses 1/4 of the data we collected, i.e. it only reads from the data


note that the "--benchmark sockperf-udp-throughput" option flag gives the
first part of the directories name (the specific benchmark) while the option
flag "--names 4.18.0-vanilla,4.18.0-teo,4.18.0-teo-v2+backport,..." completes
the directory names (the kernel patches). One can use the script too, but I like bin/ the most.

Now, the overview table. To get the overall score for, say, teo-v3, what the
script does is to compute ratios between v3 and the baseline for all message
sizes, and then taking the geometric mean of the results. The geometric mean is
chosen because it has the nice property that the geometric mean of ratios is
equal to the ratios of geometric means, which is:

 gmean(ratio(v3, baseline))

is the same as

 gmean(v3) / gmean(baseline)

where v3 and baseline are list of values (results for each message length) and
ratio() is a function that takes component-wise ratios of lists. It would be
awkward is our mean function didn't have that property, you would get
different results depending on the order of your operations (first ratio then
mean, or vice-versa).

I get overview tables running the same command above, but
adding the --print-ratio option flag.

 $ ../../bin/ --print-ratio \
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ--benchmark sockperf-udp-throughput \
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ--names 4.18.0-vanilla,4.18.0-teo,4.18.0-teo-v2+backport,4.18.0-teo-v3+backport,4.18.0-teo-v5+backport,4.18.0-teo-v6+backport

I then look at the very last line of the output, which in this case is

 Gmean HigherÂÂÂÂ1.00ÂÂÂÂ0.99ÂÂÂÂ0.99ÂÂÂÂ0.99ÂÂÂÂ0.99ÂÂÂÂ1.10

"Gmean" in there reminds me that I'm looking at geometric means; "Higher" is
to say that "higher ratio is better" (remember that sockperf-udp-throughput
measures throughput), then the first ratio is always 1.00 because that's the
reference baseline. Then I have v1, v2, v3, v5 and v6 (I omitted the
headers). All versions have a 1% regression except v6 which is a 10% gain. If
it was a lower-is-better type of test, the interpretation of those number
would be reversed.

Some specific remarks you raise:

On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
> ...
> My issue is that I do not understand the output or how it
> might correlate with your tables.
> I get, for example:
>ÂÂÂÂ3ÂÂÂÂ1ÂÂÂ1ÂÂÂÂÂ0.13sÂÂÂÂÂ0.68sÂÂÂÂÂ0.80sÂÂ1003894.302 1003779.613
>ÂÂÂÂ3ÂÂÂÂ1ÂÂÂ1ÂÂÂÂÂ0.16sÂÂÂÂÂ0.64sÂÂÂÂÂ0.80sÂÂ1008900.053 1008215.336
>ÂÂÂÂ3ÂÂÂÂ1ÂÂÂ1ÂÂÂÂÂ0.14sÂÂÂÂÂ0.66sÂÂÂÂÂ0.80sÂÂ1009630.439 1008990.265
> ...
> But I don't know what that means, nor have I been able to find
> a description anywhere.
> In the README file, I did see that for reporting I amÂ
> somehow supposed to use, but
> I couldn't figure that out.

I don't recognize this output. I hope the illustration above can clarify how
MMTests is used.

> By the way, I am running these tests as a regular user, but
> they seem to want to modify:
> /sys/kernel/mm/transparent_hugepage/enabled
> which requires root privilege. I don't really want to mess
> with that stuff for these tests.

As Mel said, in this case the warning is harmless but I understand that
requesting root permissions is annoying. At times the framework has to modify
some settings and that may require root; on one side we are careful to undo
all modifications applied to a machine for the purpose of testing after the
benchmark is completed, but is also true that MMTests has evolved over time to
be used on lab machines that can be redeployed to a clean known state at the
push of a button. The convenience of assuming root far outweighs the
limitations of this approach.

Another less then ideal characteristic of MMTests is that it downloads the
sources of benchmarks from external URLs which you, as a user, may or may not
trust. We have vetted all those URLs and determined that they represent the
canonical source for a given benchmark, but are not ultimately responsible for
their content. Inside our organization we have a mirror for the content of all
of them (the external URL is accessed only as a fallback), but when people
outside of SUSE use MMTests that's what they get.

> I had the thought that I should be able to get similar
> results as your "8x-SKYLAKE-UMA" on my test computer,
> i7-2600K. Or that at least it was worth trying, just
> to see.

Uhm. That may or may not happen. It's true that your i7-2600K and
8x-SKYLAKE-UMA have the same number of cores and threads, and that the size of
the L3 cache is the same, but the website tells me that
they are from different microarchitecture generations: i7-2600K is a Sandy
Bridge and 8x-SKYLAKE-UMA is a Skylake. Given that the timeline of Intel
microarchitectures is Sandy Bridge (2011), Haswell (2013), Broadwell (2015)
and Skylake (2016), i7-2600K might have more in common with my 48x-HASWELL-NUMA
than it has 8x-SKYLAKE-UMA.

It might seems odd to compare a 48 threads machine with an 8 threads one, but
if you look at MMTests configurations you'll notice that we try to "normalize"
over available resources such as amount of memory or number of cores. We
generally try to explore the space of parameters from "light workload" to
machine saturation; then we average the results, as you see in the
illustration above. This might offer you an alternate viewpoint if what you
get if i7-2600K doesn't match your expectation.