Re: Stopping the tick on a fully loaded system

From: Rafael J. Wysocki
Date: Mon Jul 24 2023 - 04:23:29 EST


On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker
<frederic@xxxxxxxxxx> wrote:
>
> (Adding Rafael in Cc)
>
> Le Thu, Jul 20, 2023 at 03:00:37PM +0200, Anna-Maria Behnsen a écrit :
> > I had also a look at teo. It makes things better but does not solve the
> > underlying problem that I see here - please correct me if I missed
> > something or if I'm simply wrong:
> >
> > Yes, the governors have to decide in the end, whether it makes sense to
> > stop the tick or not. For this decision, the governors require information
> > about the current state of the core and how long nothing has to be done
> > propably. At the moment the governors therefore call
> > tick_nohz_get_sleep_length(). This checks first whether the tick can be
> > stopped. Then it takes into account whether rcu, irq_work, arch_work needs
> > the CPU or a timer softirq is pending. If non of this is true, then the
> > timers are checked. So tick_nohz_get_sleep_length() isn't only based on
> > timers already.
>
> Right but those things (rcu/irq work, etc...) act kind of like timers here
> and they should be considered as exceptions.
>
> The timer infrastructure shouldn't take into account the idle activity,
> this is really a job for the cpuidle governors.
>
> > The information about the sleep length of the scheduler perspective is
> > completely missing in the current existing check for the probable sleep
> > length.

Well, not exactly.

>From the governor's perspective, tick_nohz_get_sleep_length() is
supposed to return a deterministic upper bound on the upcoming idle
duration (or at least it is used in the governors this way). IOW, it
is expected that the CPU will not be idle longer that the
tick_nohz_get_sleep_length() return value.

There are other factors that may cause the governor to predict a
shorter idle duration and the information coming from the scheduler
may be regarded as one of them, but they are not deterministic.

> > Sure, teo takes scheduler utilization into account directly in the
> > governor. But for me it is not comprehensible, why the CPU utilization
> > check is done after asking for the possible sleep length where timers are
> > taken into account. If the CPU is busy anyway, the information generated by
> > tick_nohz_next_event() is irrelevant.

Why isn't it?

The CPU is idle at that point and it has gone idle for a reason.
Surely, there was nothing to run on it at.

The scheduler only knows that the CPU has been busy recently, which
may not imply anything on whether or not and when it is going to be
busy again.

> > And when the CPU is not busy, then it
> > makes sense to ask for the sleep length also from a timer perspective.
> >
> > When this CPU utilization check is implemented directly inside the
> > governor, every governor has to implement it on it's own. So wouldn't it
> > make sense to implement a "how utilized is the CPU out of a scheduler

Yes, it would make sense to do that, but I thought that PELT was that
thing. Wasn't it?

> > perspective" in one place and use this as the first check in
> > tick_nohz_get_sleep_length()/tick_nohz_next_event()?

No, IMO it should be a separate check, because
tick_nohz_get_sleep_length() is expected to return a deterministic
upper bound on the idle duration whereas the scheduler information is
not deterministic. It is the governor's role to decide how to take it
into account.

> Well, beyond that, there might be other situations where the governor may
> decide not to stop the tick even if tick_nohz_next_event() says it's possible
> to do so. That's the purpose of having that next event as an input among many
> others for the cpuidle governors.

Exactly.

> As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
> tick _stop_ time is always going to be problematic.
>
> Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
> instead? This will change a bit the locking scenario because
> tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
> In the worst case a remote tick happens and handles the earliest timer
> for the current CPU while it's between tick_nohz_next_event() and
> tick_nohz_stop_tick(), but then the current CPU would just propagate
> an earlier deadline than needed. No big deal.

FWIW, this sounds reasonable to me.