Re: [linus:master] [timers] 7ee9887703: stress-ng.uprobe.ops_per_sec -17.1% regression

From: Rafael J. Wysocki
Date: Mon Apr 29 2024 - 13:02:27 EST


On Mon, Apr 29, 2024 at 12:40 PM Anna-Maria Behnsen
<anna-maria@xxxxxxxxxxxxx> wrote:
>
> Anna-Maria Behnsen <anna-maria@xxxxxxxxxxxxx> writes:
>
> > Hi,
> >
> > Lukasz Luba <lukasz.luba@xxxxxxx> writes:
> >> On 4/26/24 17:03, Rafael J. Wysocki wrote:
> >>> On Thu, Apr 25, 2024 at 10:23 AM Anna-Maria Behnsen
> >>> <anna-maria@xxxxxxxxxxxxx> wrote:
> >
> > [...]
> >
> >>>> So my assumption here is, that cpuidle governors assume that a deeper
> >>>> idle state could be choosen and selecting the deeper idle state makes an
> >>>> overhead when returning from idle. But I have to notice here, that I'm
> >>>> still not familiar with cpuidle internals... So I would be happy about
> >>>> some hints how I can debug/trace cpuidle internals to falsify or verify
> >>>> this assumption.
> >>>
> >>> You can look at the "usage" and "time" numbers for idle states in
> >>>
> >>> /sys/devices/system/cpu/cpu*/cpuidle/state*/
> >>>
> >>> The "usage" value is the number of times the governor has selected the
> >>> given state and the "time" is the total idle time after requesting the
> >>> given state (ie. the sum of time intervals between selecting that
> >>> state by the governor and wakeup from it).
> >>>
> >>> If "usage" decreases for deeper (higher number) idle states relative
> >>> to its value for shallower (lower number) idle states after applying
> >>> the test patch, that will indicate that the theory is valid.
> >>
> >> I agree with Rafael here, this is the first thing to check, those
> >> statistics. Then, when you see difference in those stats in baseline
> >> vs. patched version, we can analyze the internal gov decisions
> >> with help of tracing.
> >>
> >> Please also share how many idle states is in those testing platforms.
> >
> > Thanks Rafael and Lukasz, for the feedback here!
> >
> > So I simply added the state usage values for all 112 CPUs and calculated
> > the diff before and after the stress-ng call. The values are from a
> > single run.
> >
>
> Now here are the values of the states and the time because I forgot to
> track also the time in the first run:
>
> USAGE good bad bad+patch
> ---- --- ---------
> state0 115 137 234
> state1 450680 354689 420904
> state2 3092092 2687410 3169438
>
>
> TIME good bad bad+patch
> ---- --- ---------
> state0 9347 9683 18378
> state1 626029557 562678907 593350108
> state2 6130557768 6201518541 6150403441
>
>
> > good: 57e95a5c4117 ("timers: Introduce function to check timer base
> > is_idle flag")
> > bad: v6.9-rc4
> > bad+patch: v6.9-rc4 + patch
> >
> > I choosed v6.9-rc4 for "bad", to make sure all the timer pull model fixes
> > are applied.
> >
> > If I got Raphael right, the values indicate, that my theory is not
> > right...

It appears so.

However, the hardware may refuse to enter a deeper idle state in some cases.

It would be good to run the test under turbostat and see what happens
to hardware C-state residencies. I would also like to have a look at
the CPU frequencies in use in all of the cases above.

> ... but with the time values: CPUs are less often but in total longer in state2.

I have divided the total residency numbers above by the corresponding
usage numbers and got the below:

state1: 1389,08 1586,40 1409,70
state2: 1982,66 2307,62 1940,53

for "good", "bad" and "bad+patch" , respectively.

This shows that, on the average, after entering an idle state, a CPU
spends more time in it in the "bad" case than in the other cases.

To me, this means that, on the average, in the "bad" case there are
fewer wakeups from idle states (or IOW the wakeups occur less
frequently) and that seems to affect the benchmark in question
adversely.

Thanks!