RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

From: Doug Smythies
Date: Wed Oct 30 2019 - 12:01:18 EST


On 2019.10.29 13:00 Srinivas Pandruvada wrote:
> On Tue, 2019-10-29 at 12:34 -0700, Srinivas Pandruvada wrote:
>> On Fri, 2019-10-25 at 08:55 -0700, Doug Smythies wrote:
>>
>> [...]
>>
>>> Experiment method:
>>>
>>> enable only idle state 1
>> Dountil stopped
>>> apply a 100% load (all CPUs)
>>> after awhile (about 50 seconds) remove the load.
>>> allow a short transient delay (1 second).
>>> measure the processor package joules used over the next 149
>>> seconds.
>>> Enduntil
>>>
>>> Kernel k5.4-rc2 + reversion (this method)
>>> Average processor package power: 9.148 watts (128 samples, > 7
>>> hours)
>>> Minimum: 9.02 watts
>>> Maximum: 9.29 watts
>>> Note: outlyer data point group removed, as it was assumed the
>>> computer
>>> had something to do and wasn't actually "idle".
>>>
>>> Kernel 5.4-rc2:
>>> Average processor package power: 9.969 watts (150 samples, > 8
>>> hours)
>>> Or 9% more energy for the idle phases of the work load.
>>> Minimum: 9.15 watts
>>> Maximum: 13.79 watts (51% more power)

>> Hi Doug,

Hi Srinivas,

>>
>> Do you have intel_pstate_tracer output?

Yes, I have many many runs of intel_pstate_tracer.py
and many plots of pstate and CPU frequency lingering high
for a very very long time after the load is removed.
Here is one example (my reference: results/teo041):

The load is removed at test time 539.047 seconds,
and requested pstates do start to fall. Example,
cpu 4 at time 539.052, pstate request goes from
38 (the max for an i7-2600K) to 32. The last CPU
to reduce the pstate request is CPU 0 at time
539.714, but only to 25.

Then, CPU 4 doesn't run the driver for another
5.9 seconds, and even then only reduces its request
to pstate 21.

CPU 4 remains the defining CPU, and doesn't run the
driver again until time 577.235 seconds, at which time
its pstate request drops to 18, even with 0 load.
So, 38 seconds, and still only at pstate 18.

>> I guess that when started
>> request to measure the measure joules, it started at higher P-state
>> without revert.

No, not really. The main difference is in the time it takes to fully
drop to the lowest pstate.

>> Other way is check by fixing the max and min scaling frequency to
>> some frequency, then we shouldn't see power difference.

Yes, I did that, to learn the numbers.

> I mean not significant power difference.

For idle state 1, at least for my processor (i7-2600K), the difference
is huge. Keep in mind that (at least for my processor) a CPU in idle
state 1 does not relinquish its vote into the CPU frequency PLL, thus
the highest request dictates the CPU frequency.

Here are the idle state 1 powers (42 percent is the minimum for my
processor. For reference, with all idle state enabled, the idle
power is 3.68 watts and the processor package temperature is about
25 degrees, independent of the requested pstate):

Min-percent watts temp
42 8.7 35
50 10.0 36
60 12.0 37
70 14.4 38
80 17.3 41
90 21 43
100 21 43

Note that the 90 (pstate 35) and 100 (pstate 38)
powers are the same due to this (I assume):

cpu5: MSR_TURBO_RATIO_LIMIT: 0x23242526
35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores

And can be verified by looking at the request
And granted MSRs directly:

At 100% min percent:

Requested:
doug@s15:~/temp-k-git/linux$ sudo rdmsr --bitfield 15:8 -d -a 0x199
38
38
38
38
38
38
38
38

Granted:
doug@s15:~/temp-k-git/linux$ sudo rdmsr --bitfield 15:8 -d -a 0x198
35
35
35
35
35
35
35
35

> Also to get real numbers, need
> to use some power meter measuring CPU power.

Well, I have one, but for the box AC only. It just didn't seem
worth the overhead. Yes, I used the joules MSR directly. I also
do a sanity check by checking that the processor package temperature
makes sense for the calculated processor package watts. I added a
temperature column above.

> If I can get your script,
> I may be able to measure that.

Hmmm... This was actually a saga all by itself, mainly my own
fault. I am running a bit of a mess here so that I could minimize
the time between the multiple load drops from 100% to 0% so as
to make it easier to follow via the intel_pstate_tracer data.
Load methods aside, the rest is pretty simple:

doug@s15:~/idle$ cat load-no-load-forever2
#!/bin/dash

#
# load-no-load-forever2. Smyhies 2019.10.21
# Just trying to get some debug data.
# apply load (real, not via disabling all idle
# states) then no load. loop forever.
# load version 2.

echo "load-no-load-forever2. Start. Doug Smythies 2019.10.21"

while [ 1 ];
do
~/c/waiter 9 2 4 2000000000 0 1 > /dev/null
sleep 1
sudo ~/c/measure_energy 149
done

Where "measure_energy.c" just samples the
joules MSR over an interval, sometimes a longer
interval than turbostat will allow (with full
accuracy), because I know that the joules counter
did not wrap around.
In a separate e-mail I'll send you the c programs,
although I seem to recall that Intel strips out
attached c programs from e-mails.

Hope this helps.

... Doug