RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

From: Doug Smythies
Date: Thu Sep 19 2019 - 10:42:41 EST

Next message: Qais Yousef: "Re: Usecases for the per-task latency-nice attribute"
Previous message: James Morse: "Re: [UNVERIFIED SENDER] Re: [UNVERIFIED SENDER] Re: [PATCH v2 2/3] soc: amazon: al-pos: Introduce Amazon's Annapurna Labs POS driver"
In reply to: Giovanni Gherdovich: "Re: [PATCH 1/2] x86,sched: Add support for frequency invariance"
Next in thread: Giovanni Gherdovich: "Re: [PATCH 1/2] x86,sched: Add support for frequency invariance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Giovanni,

Thank you for your detailed reply.

On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> [...]

>> The problem with the test is its run to run variability, which was from
>> all the disk I/O, as far as I could determine. At the time,
>> I studied this to death [2], and made a more repeatable test, without
>> any disk I/O.
>>
>> While the challenges with this work flow have tended to be focused
>> on the CPU frequency scaling driver, I have always considered
>> the root issue here to be a scheduling issue. Excerpt from my notes
>> [2]:
>>
>>> The issue is that performance is much much better if the system is
>>> forced to use only 1 CPU rather than relying on the defaults where
>>> the CPU scheduler decides what to do.
>>> The scheduler seems to not realize that the current CPU has just
>>> become free, and assigns the new task to a new CPU. Thus the load
>>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>>> It would be better if somehow the scheduler knew that the current
>>> active CPU was now able to take on the new task, overall resulting
>>> on one fully loaded CPU at the highest CPU frequency.
>>
>> I do not know if such is practical, and I didn't re-visit the issue.
>>
>
> You're absolutely right, pinning a serialized, fork-intensive workload such as
> gitsource gives you as good of a performance as you can get, because it removes
> the scheduler out of the picture.
>
> So one might be tempted to flag this test as non-representative of a
> real-world scenario;

Disagree. I consider this test to be very representative of real-world
scenarios. However, and I do not know for certain, the relatively high
average fork rate of the gitsource "make test" is less common.

> the reasons we keep looking at it are:
> 1. pinning may not always practical, as you mention
> 2. it's an adversary, worst-case sort of test for some scheduler code paths

Agree.

>> For reference against which all other results are compared
>> is the forced CPU affinity test run. i.e.:
>>
>> taskset -c 3 test_script.
>>
>> Mode Governor degradation Power Bzy_MHz
>> Reference perf 1 CPU 1.00 reference 3798
>> - performance 1.2 6% worse 3618
>> passive ondemand 2.3
>> active powersave 2.6
>> passive schedutil 2.7 1600
>> passive schedutil-4C 1.68 2515
>>
>> Where degradation ratio is the time to execute / the reference time for
>> the same conditions. The test runs over a wide range of processes per
>> second, and the worst ratio has been selected for the above table.
>> I have yet to write up this experiment, but the graphs that will
>> eventually be used are at [4] and [5] (same data presented two
>> different ways).
>
> Your table is interesting; I'd say that the one to beat there (from the
> schedutil point of view) is intel_pstate(active)/performance. I'm slightly
> surprised that intel_pstate(passive)/ondemand is worse than
> intel_pstate(active)/powersave, I'd have guessed the other way around but it's
> also true that the latter lost some grip on iowait_boost in of the recent
> dev cycles.

??
intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave,
not worse, over the entire range of PIDs (forks) per second and by quite a lot.

>> I did the "make test" method and, presenting the numbers your way,
>> got that 4C took 0.69 times as long as the unpatched schedutil.
>> Your numbers were same or better (copied below, lower is better):
>> 80x-BROADWELL-NUMA: 0.49
>> 8x-SKYLAKE-UMA: 0.55
>> 48x-HASWELL-NUMA: 0.69

> I think your 0.69 and my three values tell the same story: schedutil really
> needs to use the frequency invariant formula otherwise it's out of the
> race. Enabling scale-invariance gives multple tens of percent point in
> advantage.

Agreed. This frequency invariant addition is great. However, if
schedutil is "out of the race" without it, as you say, then isn't
intel_pstate(passive)/ondemand out of the race also? It performs
just as poorly for this test, until very low PIDs per second.

> Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I
> can test on multiple machines and get a somehow more varied picture.
>
> Also, didn't you mention you made several runs and selected the worst one for
> the final score? I was less adventurous and took the average of 5 runs for my
> gitsource executions :) that might contribute to a slightly higher final mark.

No, I did the exact same as you for the gitsource "make test" method, except
that I do 6 runs and throw out the first one and average the next 5.

Yes, I said I picked the worse ratio, but that was for my version of this test,
with the disk I/O and its related non-repeatability eliminated, only to provide
something for readers that did not want to go to my web site to look at the
related graph [1]. I'll send you the graph in a separate e-mail, in case you didn't
go to the web site.

>>>>
>>>> Compare it to the update formula of intel_pstate/powersave:
>>>
>>> freq_next = 1.25 * freq_max * Busy%
>>>
>>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent
>>> idling (calculated with delta_MPERF / delta_TSC);
>>
>> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
>> method of utilization does not (at least not last time I investigated, which was
>> awhile ago (and I can not find my notes)).
>
> I think that depends on whether or not TSC stops at idle. As understand from
> the Intel Software Developer manual (SDM) a TSC that stops at idle is called
> "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the
> two counters behaves exactly the same and the ratio is always 1, modulo the
> delays in actually reading the two values. But all I know comes from
> turbostat's man page and the SDM, so don't quote me on that :)

I was only talking about idle state 0 (polling), where TSC does not stop.

By the way, I have now done some tests with this patch set and multi-threaded
stuff. Nothing to report, it all looks great.

[1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png

... Doug

Next message: Qais Yousef: "Re: Usecases for the per-task latency-nice attribute"
Previous message: James Morse: "Re: [UNVERIFIED SENDER] Re: [UNVERIFIED SENDER] Re: [PATCH v2 2/3] soc: amazon: al-pos: Introduce Amazon's Annapurna Labs POS driver"
In reply to: Giovanni Gherdovich: "Re: [PATCH 1/2] x86,sched: Add support for frequency invariance"
Next in thread: Giovanni Gherdovich: "Re: [PATCH 1/2] x86,sched: Add support for frequency invariance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]