Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state

From: Rafael J. Wysocki
Date: Mon Jul 31 2023 - 07:00:44 EST


On Mon, Jul 31, 2023 at 12:35 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Mon, Jul 31, 2023 at 11:11 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Jul 31, 2023 at 10:01:53AM +0200, Rafael J. Wysocki wrote:
> > > On Sat, Jul 29, 2023 at 10:44 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 05:36:55PM +0200, Rafael J. Wysocki wrote:
> > > > > On Fri, Jul 28, 2023 at 5:01 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > In order to facilitate governors that track history in idle-state
> > > > > > buckets (TEO) making a useful decision about NOHZ, make sure we have a
> > > > > > bucket that counts tick-and-longer.
> > > > > >
> > > > > > In order to be inclusive of the tick itself -- after all, if we do not
> > > > > > disable NOHZ we'll sleep for a full tick, the actual boundary should
> > > > > > be just short of a full tick.
> > > > > >
> > > > > > IOW, when registering the idle-states, add one that is always
> > > > > > disabled, just to have a bucket.
> > > > >
> > > > > This extra bucket can be created in the governor itself, can't it?
> > > >
> > > > I couldn't find a nice spot for the governor to add idle-states.
> > >
> > > Well, I've thought this through and recalled a couple of things and my
> > > conclusion is that the decision whether or not to stop the tick really
> > > depends on the idle state choice.
> > >
> > > There are three cases:
> > >
> > > 1. The selected idle state is shallow (that is, its target residency
> > > is below the tick period length), but it is not the deepest one.
> > > 2. The selected idle state is shallow, but it is the deepest one (or
> > > at least the deepest enabled one).
> > > 3. The selected idle state is deep (that is, its target residency is
> > > above the tick length period).
> > >
> > > In case 1, the tick should not be stopped so as to prevent the CPU
> > > from spending too much time in a suboptimal idle state.
> > >
> > > In case 3, the tick needs to be stopped, because otherwise the target
> > > residency of the selected state would not be met.
> > >
> > > Case 2 is somewhat a mixed bag, but generally speaking stopping the
> > > tick doesn't hurt if the selected idle state is the deepest one,
> > > because in that case the governor kind of expects a significant exit
> > > latency anyway. If it is not the deepest one (which is disabled),
> > > it's better to let the tick tick.
> >
> > So I agree with 1.
> >
> > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > timer-pull thing, but even today. Simply disabling it because we picked
> > the deepest idle state, irrespective of the expected duration is wrong
> > as it will incur this significant cost.
> >
> > With 3 there is the question of how we get the expected sleep duration;
> > this is especially important with timer-pull, where we have this
> > chicken-and-egg thing.
> >
> > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > disabled
>
> Well, it shouldn't. Or at least it didn't before.
>
> It is expected to produce two values, one with the tick stopped (this
> is the return value of the function) and the other with the tick
> ticking (this is the one written under the address passed as the arg).
> This cannot depend on whether or not the tick will be stopped. Both
> are good to know.
>
> Now, I understand that getting these two values may be costly, so
> there is an incentive to avoid calling it, but then the governor needs
> to figure this out from its crystal ball and so care needs to be taken
> to limit the possible damage in case the crystal ball is not right.
>
> > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > determine if to disable the tick. This cycle needs to be broken for
> > timer-pull.
> >
> > Hence my proposal to introduce the extra tick state, that allows fixing
> > both 2 and 3.
>
> I'm not sure about 3 TBH.
>
> Say there are 2 idle states, one shallow (say its target residency is
> 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).
>
> Currently, there are 3 bins in this case, 0 (0 - 10 us), 1 (10 us - T)
> and 2 (T - infinity) and the governor will need to check bins 1 and 2
> unless it somehow knows that it will use state 0.
>
> Note that all of the events falling into bin 1 will cause the governor
> to select state 0 more often, so I don't see how adding one more bin
> between 1 and 2 changes this.

BTW, in teo, using the return value of tick_nohz_get_sleep_length()
for the initial state selection is an optimization. In principle, it
could consider all of the bins every time and use the
tick_nohz_get_sleep_length() return value to refine the state
selection in case it turns out to be small enough.

So if this is changed, it would avoid calling
tick_nohz_get_sleep_length() when it knows that, say, state 0 will be
selected anyway (or that the first non-polling state will be selected
anyway or some such).