Re: [PATCH v2 0/5] s390/idle: CPU idle driver

From: Mete Durlu

Date: Wed Jun 17 2026 - 16:08:40 EST

On 6/17/26 9:15 AM, Christian Loehle wrote:

On 6/10/26 21:23, Mete Durlu wrote:

On 6/9/26 5:47 PM, Christian Loehle wrote:

On 6/9/26 14:24, Mete Durlu wrote:

v1 -> v2:

* Add idle driver enteries to MAINTAINERS file (Christian Loehle)
* Remove extra line break left in drivers/cpuidle/Kconfig.s390
(Christian Loehle)

This patch series introduces a CPU idle driver for s390
architecture that leverages the existing cpu idle infrastructure and
TEO (Timer Events Oriented) governor to optimize idle state selection
based on timer events and interrupt patterns.

...

- Configuration
-----------------------------------------------------------------------

Idle state parameters are tuned per hypervisor type after benchmarks:

**LPAR:**
- Polling: 5us target residency, 0us exit latency
- EW: 5us target residency, 5us exit latency

**KVM/z/VM:**
- Polling: 1us target residency, 0us exit latency
- EW: 1us target residency, 1us exit latency

I think this would also be useful in cpuidle-s390.c in particular the
different residency+latency values for LPAR and KVM/z/VM and what they aim
to achieve for you.

We can put down a comment like below;

/*
* After various benchmark runs the tuneables for idle driver has shown
* the best performance with the following values;
* for LPAR:
* - Polling: 5us target residency, 0us exit latency
* - EW: 5us target residency, 5us exit latency
*
* for KVM/z/VM:
* - Polling: 1us target residency, 0us exit latency
* - EW: 1us target residency, 1us exit latency
*/

Is that what you are looking for or something more extensive to cover
what sort of behavior it causes and why it benefits the performance?
I wouldn't really like to put down lengthy comments here to be honest.

Why not?
At least to me the values are hard to follow and without knowing what they
represent.
For example is this actually the worst case exit latency for these systems?
Also from a cpuidle perspective residency == latency isn't outright wrong
but certainly always a little suspicious.

Since s390 is running virtualized machines, the values are not
deterministic. The values picked here are the best estimates backed by
testing. That is why I didn't want to put anything down initially but
of course the comment can also describe that. I'll add that in the
next version.

Additionally polling is initialised to 0/0 by poll_state.c, so I don't know
where you're taking these values from?

Having a look at the implementation of poll_idle() in poll_state.c.
The polling time limit (target_residency) is acquired from
cpuidle_poll_time(), which tries to find an enabled state deeper than
polling state and returns its target_residency. Since we only have two
states, it automatically means EW state's target_residency, or
IDLE_POLL_MAX if it is disabled.

I agree that cpuidle_poll_time() can derive the polling time limit from a
deeper enabled state, but it only does that for states with
target_residency_ns >= CPUIDLE_POLL_MIN (10us). With the proposed s390 EW
values of 1us/5us, that state is skipped and the poll limit falls
back to CPUIDLE_POLL_MAX (TICK_NSEC / 16).

Ah missed that CPUIDLE_POLL_MIN is 10us and had it wrong as less or
equal to 1us. For 1000Hz CPUIDLE_POLL_MAX calculates to 6250ns
which is close enough for LPAR (5us) but for KVM and ZVM case (1us)
it is a bit off.

Although, the benchmarks didn't reflect a problem. To verify I did
a quick test run with the following applied on a KVM and a ZVM guest;

--- a/drivers/cpuidle/cpuidle-s390.c
+++ b/drivers/cpuidle/cpuidle-s390.c
@@ -72,6 +72,20 @@ static int s390_cpuidle_cpu_dead(unsigned int cpu)
return 0;
}

+static void __init s390_cpuidle_poll_tune(void)
+{
+ struct cpuidle_device *dev;
+ int cpu;
+
+ for_each_present_cpu(cpu) {
+ dev = &per_cpu(cpuidle_dev, cpu);
+ if (machine_is_lpar())
+ dev->poll_limit_ns = 5000;
+ else
+ dev->poll_limit_ns = 1000;
+ }
+}
+

This causes cpuidle_poll_time() to exit early as poll_limit_ns is
already set for the polling cpu.

The results are quite similar to original benchmark results I collected,
therefore I don't think it is worth adding it to the patch series.

...

That said, one could argue that the value of CPUIDLE_POLL_MIN could be
reconsidered. Looking at the comments above and also the introducing
commit 7a25759eaa04 ("cpuidle: Select polling interval based on a c-state with a longer target residency")

```
/*
* Min polling interval of 10usec is a guess. It is assuming that
* for most users, the time for a single ping-pong workload like
* perf bench pipe would generally complete within 10usec but
* this is hardware dependent. Actual time can be estimated with
*
* perf bench sched pipe -l 10000
*
...
#define CPUIDLE_POLL_MIN 10000
```

- On s390 "perf bench sched pipe -l 10000" results in usecs/op values
in between 0.9-3.5 usecs.

- On my x86 laptop this range is between 1-2 usecs/op.

Based on that observation and reading the commit message I'd say
there should be some improvements possible there but, this should
be tackled with a different patch series. In fact I'd like to give
it a go after this patch series.

Thanks for having a look. I'll send a new version shortly.