Re: [PATCH v7] sched: Consolidate cpufreq updates

From: Christian Loehle
Date: Thu Oct 10 2024 - 14:57:57 EST


On 10/8/24 10:56, Christian Loehle wrote:
> On 10/7/24 18:20, Anjali K wrote:
>> Hi, I tested this patch to see if it causes any regressions on bare-metal power9 systems with microbenchmarks.
>> The test system is a 2 NUMA node 128 cpu powernv power9 system. The conservative governor is enabled.
>> I took the baseline as the 6.10.0-rc1 tip sched/core kernel.
>> No regressions were found.
>>
>> +------------------------------------------------------+--------------------+----------+
>> |                     Benchmark                        |      Baseline      | Baseline |
>> |                                                      |  (6.10.0-rc1 tip   | + patch  |
>> |                                                      |  sched/core)       |          |
>> +------------------------------------------------------+--------------------+----------+
>> |Hackbench run duration (sec)                          |         1          |   1.01   |
>> |Lmbench simple fstat (usec)                           |         1          |   0.99   |
>> |Lmbench simple open/close (usec)                      |         1          |   1.02   |
>> |Lmbench simple read (usec)                            |         1          |   1      |
>> |Lmbench simple stat (usec)                            |         1          |   1.01   |
>> |Lmbench simple syscall (usec)                         |         1          |   1.01   |
>> |Lmbench simple write (usec)                           |         1          |   1      |
>> |stressng (bogo ops)                                   |         1          |   0.94   |
>> |Unixbench execl throughput (lps)                      |         1          |   0.97   |
>> |Unixbench Pipebased Context Switching throughput (lps)|         1          |   0.94   |
>> |Unixbench Process Creation (lps)                      |         1          |   1      |
>> |Unixbench Shell Scripts (1 concurrent) (lpm)          |         1          |   1      |
>> |Unixbench Shell Scripts (8 concurrent) (lpm)          |         1          |   1.01   |
>> +------------------------------------------------------+--------------------+----------+
>>
>> Thank you,
>> Anjali K
>>
>
> The default CPUFREQ_DBS_MIN_SAMPLING_INTERVAL is still to have 2 ticks between
> cpufreq updates on conservative/ondemand.
> What is your sampling_rate setting? What's your HZ?
> Interestingly the context switch heavy benchmarks still show -6% don't they?
> Do you mind trying schedutil with a reasonable rate_limit_us, too?


After playing with this a bit more I can see a ~6% regression on
workloads like hackbench too.
That is to around 80% because of the update in check_preempt_wakeup_fair(),
the rest because of the context-switch. Overall the number of
cpufreq_update_util() calls for hackbench -pTl 20000 increased by a
factor of 20-25x, removing the one in check_preempt_wakeup_fair() brings
this down to 10x. For other workloads the amount of
cpufreq_update_util() calls is in the same ballpark as mainline.

I did also look into the forced_update mechanism because that still
bugged me and have to say, I'd prefer removing rate_limit_us,
last_freq_update_time and freq_update_delay_ns altogether. The number
of updates blocked by the rate_limit was already pretty low and have
become negligible now for most workloads/platforms.
commit 37c6dccd6837 ("cpufreq: Remove LATENCY_MULTIPLIER") put the
rate_limit_us in the microseconds but even for rate_limit_us==2000
I get on a rk3588 ([LLLL][bb][bb]), 250HZ:

mainline:
update_util update_util dropped by rate_limit_us actual freq changes
60s idle:
932 48 12

fio --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/nullb0 --thinktime=1ms
40274 129 36

hackbench -pTl 20000
319331 523 41

with $SUBJECT and rate_limit_us==93:
60s idle:
1031 5 11

fio --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/nullb0 --thinktime=1ms
40297 17 32

hackbench -pTl 20000
7252343 600 60

just to mention a few.
This obviously depends on the OPPs, workload, and HZ though.

Overall I find the update (mostly) coming from the perf-domain
(and thus sugov update_lock also mostly contending there) quite
appealing, but given we update more often in terms of frequency and
arguably have more code locations calling the update (reintroduction
of update at enqueue), what exactly are we still consolidating here?

Regards,
Christian