Re: [linus:master] [timers] 7ee9887703: stress-ng.uprobe.ops_per_sec -17.1% regression

From: Lukasz Luba
Date: Mon Apr 29 2024 - 03:54:08 EST




On 4/26/24 17:03, Rafael J. Wysocki wrote:
Hi,

On Thu, Apr 25, 2024 at 10:23 AM Anna-Maria Behnsen
<anna-maria@xxxxxxxxxxxxx> wrote:

Hi,

(adding cpuidle/power people to cc-list)

Oliver Sang <oliver.sang@xxxxxxxxx> writes:

hi, Frederic Weisbecker,

On Tue, Apr 02, 2024 at 12:46:15AM +0200, Frederic Weisbecker wrote:
Le Wed, Mar 27, 2024 at 04:39:17PM +0800, kernel test robot a écrit :


Hello,


we reported
"[tip:timers/core] [timers] 7ee9887703: netperf.Throughput_Mbps -1.2% regression"
in
https://lore.kernel.org/all/202403011511.24defbbd-oliver.sang@xxxxxxxxx/

now we noticed this commit is in mainline and we captured further results.

still include netperf results for complete. below details FYI.


kernel test robot noticed a -17.1% regression of stress-ng.uprobe.ops_per_sec
on:

The good news is that I can reproduce.
It has made me spot something already:

https://lore.kernel.org/lkml/ZgsynV536q1L17IS@xxxxxxxxxxxxx/T/#m28c37a943fdbcbadf0332cf9c32c350c74c403b0

But that's not enough to fix the regression. Investigation continues...

Thanks a lot for information! if you want us test any patch, please let us know.

Oliver, I would be happy to see, whether the patch at the end of the
message restores the original behaviour also in your test setup. I
applied it on 6.9-rc4. This patch is not a fix - it is just a pointer to
the kernel path, that might cause the regression. I know, it is
probable, that a warning in tick_sched is triggered. This happens when
the first timer is alredy in the past. I didn't add an extra check when
creating the 'defacto' timer thingy. But existing code handles this
problem already properly. So the warning could be ignored here.

For the cpuidle people, let me explain what I oberserved, my resulting
assumption and my request for help:

cpuidle governors use expected sleep length values (beside other data)
to decide which idle state would be good to enter. The expected sleep
length takes the first queued timer of the CPU into account and is
provided by tick_nohz_get_sleep_length(). With the timer pull model in
place the non pinned timers are not taken into account when there are
other CPUs up and running which could handle those timers. This could
lead to increased sleep length values. On my system during the stress-ng
uprobes test it was in the range of maximum 100us without the patch set
and with the patch set the maximum was in a range of 200sec. This is
intended behaviour, because timers which could expire on any CPU should
expire on the CPU which is busy anyway and the non busy CPU should be
able to go idle.

Those increased sleep length values were the only anomalies I could find
in the traces with the regression.

I created the patch below which simply fakes the sleep length values
that they take all timers of the CPU into account (also the non
pinned). This patch kind of restores the behavoir of
tick_nohz_get_sleep_length() before the change but still with the timer
pull model in place.

With the patch the regression was gone, at least on my system (using
cpuidle governor menu but also teo).

So my assumption here is, that cpuidle governors assume that a deeper
idle state could be choosen and selecting the deeper idle state makes an
overhead when returning from idle. But I have to notice here, that I'm
still not familiar with cpuidle internals... So I would be happy about
some hints how I can debug/trace cpuidle internals to falsify or verify
this assumption.

You can look at the "usage" and "time" numbers for idle states in

/sys/devices/system/cpu/cpu*/cpuidle/state*/

The "usage" value is the number of times the governor has selected the
given state and the "time" is the total idle time after requesting the
given state (ie. the sum of time intervals between selecting that
state by the governor and wakeup from it).

If "usage" decreases for deeper (higher number) idle states relative
to its value for shallower (lower number) idle states after applying
the test patch, that will indicate that the theory is valid.

I agree with Rafael here, this is the first thing to check, those
statistics. Then, when you see difference in those stats in baseline
vs. patched version, we can analyze the internal gov decisions
with help of tracing.

Please also share how many idle states is in those testing platforms.

BTW, this stress-ng app looks like is a good candidate for OSPM
discussion that we (me & Rafael) are going to conduct this year.
We are going to talk about QoS for frequency and latency for apps.
Those governors (in idle, cpufreq, devfreq) try hard to 'recognize' what
should be best platform setup for particular workloads, but it's really
tough to get it right w/o user-space help.

Therefore, beside these proposed fixes for new timers model, we need
something 'newer' in our Linux, since the HW evolves (e.g. L3 cache
w/ DVFS in phones) IMO.

Regards,
Lukasz