Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full CPUs
From: Christian Loehle
Date: Tue Aug 19 2025 - 05:12:15 EST
On 8/18/25 18:41, Rafael J. Wysocki wrote:
> On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> <christian.loehle@xxxxxxx> wrote:
>>
>> On 8/13/25 11:29, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>>>
>>> When the menu governor runs on a nohz_full CPU and there are no user
>>> space timers in the workload on that CPU, it ends up selecting idle
>>> states with target residency values above TICK_NSEC all the time due to
>>> a tick_nohz_tick_stopped() check designed for a different use case.
>>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
>>> not actually mean anything in particular, whereas in the other case it
>>> indicates that previously the CPU was expected to be idle sufficiently
>>> long for the tick to be stopped, so it is not unreasonable to expect
>>> it to be idle beyond the tick period length again.
>>>
>>> In some cases, this behavior causes latency in the workload to grow
>>> undesirably. It may also cause the workload to consume more energy
>>> than necessary if the CPU does not spend enough time in the selected
>>> deep idle states.
>>>
>>> Address this by amending the tick_nohz_tick_stopped() check in question
>>> with a tick_nohz_full_cpu() one to avoid using the time till the next
>>> timer event as the predicted_ns value all the time on nohz_full CPUs.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>>> ---
>>> drivers/cpuidle/governors/menu.c | 12 +++++++++++-
>>> 1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -293,8 +293,18 @@
>>> * in a shallow idle state for a long time as a result of it. In that
>>> * case, say we might mispredict and use the known time till the closest
>>> * timer event for the idle state selection.
>>> + *
>>> + * However, on nohz_full CPUs the tick does not run as a rule and the
>>> + * time till the closest timer event may always be effectively infinite,
>>> + * so using it as a replacement for the predicted idle duration would
>>> + * effectively always cause the prediction results to be discarded and
>>> + * deep idle states to be selected all the time. That might introduce
>>> + * unwanted latency into the workload and cause more energy than
>>> + * necessary to be consumed if the discarded prediction results are
>>> + * actually accurate, so skip nohz_full CPUs here.
>>> */
>>> - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
>>> + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
>>> + predicted_ns < TICK_NSEC)
>>> predicted_ns = data->next_timer_ns;
>>>
>>> /*
>>>
>>>
>>>
>>
>> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
>> get_typical_interval() (which may suggest picking a shallow state based on
>> previous wakeup patterns) only then to never wake up again?
>
> Yes, there is this risk, but the current behavior is more damaging IMV
> because it (potentially) hurts both energy efficiency and performance.
>
> It is also arguably easier for the user to remedy getting stuck in a
> shallow idle state than to change governor's behavior (PM QoS is a bit
> too blunt for this).
>
> Moreover, configuring CPUs as nohz_full and leaving them in long idle
> may not be the most efficient use of them.
True, on the other hand the setup cost for nohz_full is so high, you'd expect
the additional idle states disabling depending on the workload isn't too much
to ask for...
Anyway feel free to go ahead.