Re: [PATCH v8 00/11] Enable haltpoll on arm64

From: zhenglifeng (A)
Date: Tue Oct 08 2024 - 22:37:35 EST




On 2024/9/26 7:24, Ankur Arora wrote:
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
>
> Comparing idle switching latencies on an arm64 KVM guest with
> perf bench sched pipe:
>
> usecs/op %stdev
>
> no haltpoll (baseline) 13.48 +- 5.19%
> with haltpoll 6.84 +- 22.07%
>
>
> No change in performance for a similar test on x86:
>
> usecs/op %stdev
>
> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76%
> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31%
>
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.
>
> Tomohiro Misono and Haris Okanovic also report similar latency
> improvements on Grace and Graviton systems (for v7) [1] [2].
>
> The patch series is organized in three parts:
>
> - patch 1, reorganizes the poll_idle() loop, switching to
> smp_cond_load_relaxed() in the polling loop.
> Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
> renaming it to ARCH_HAS_OPTIMIZED_POLL.
>
> - patches 4-6 reorganize the haltpoll selection and init logic
> to allow architecture code to select it.
>
> - and finally, patches 7-11 add the bits for arm64 support.
>
> What is still missing: this series largely completes the haltpoll side
> of functionality for arm64. There are, however, a few related areas
> that still need to be threshed out:
>
> - WFET support: WFE on arm64 does not guarantee that poll_idle()
> would terminate in halt_poll_ns. Using WFET would address this.
> - KVM_NO_POLL support on arm64
> - KVM TWED support on arm64: allow the host to limit time spent in
> WFE.
>
>
> Changelog:
>
> v8: No logic changes. Largely respin of v7, with changes
> noted below:
>
> - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
> own patch.
> (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>
> - address comments simplifying arm64 support (Will Deacon)
> (patch-11 "arm64: support cpuidle-haltpoll")
>
> v7: No significant logic changes. Mostly a respin of v6.
>
> - minor cleanup in poll_idle() (Christoph Lameter)
> - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
> (Tomohiro Misono)
>
> v6:
>
> - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
> changes together (comment from Christoph Lameter)
> - threshes out the commit messages a bit more (comments from Christoph
> Lameter, Sudeep Holla)
> - also rework selection of cpuidle-haltpoll. Now selected based
> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
> - moved back to arch_haltpoll_want() (comment from Joao Martins)
> Also, arch_haltpoll_want() now takes the force parameter and is
> now responsible for the complete selection (or not) of haltpoll.
> - fixes the build breakage on i386
> - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
> Tomohiro Misono, Haris Okanovic)
>
>
> v5:
> - rework the poll_idle() loop around smp_cond_load_relaxed() (review
> comment from Tomohiro Misono.)
> - also rework selection of cpuidle-haltpoll. Now selected based
> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
> - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
> arm64 now depends on the event-stream being enabled.
> - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
> - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>
> v4 changes from v3:
> - change 7/8 per Rafael input: drop the parens and use ret for the final check
> - add 8/8 which renames the guard for building poll_state
>
> v3 changes from v2:
> - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
> - add Ack-by from Rafael Wysocki on 2/7
>
> v2 changes from v1:
> - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
> (this improves by 50% at least the CPU cycles consumed in the tests above:
> 10,716,881,137 now vs 14,503,014,257 before)
> - removed the ifdef from patch 1 per RafaelW
>
> Please review.
>
> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@xxxxxxxxxx/
>
> Ankur Arora (6):
> cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
> cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
> arm64: idle: export arch_cpu_idle
> arm64: select ARCH_HAS_OPTIMIZED_POLL
> cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
> arm64: support cpuidle-haltpoll
>
> Joao Martins (4):
> Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
> cpuidle-haltpoll: define arch_haltpoll_want()
> governors/haltpoll: drop kvm_para_available() check
> arm64: define TIF_POLLING_NRFLAG
>
> Mihai Carabas (1):
> cpuidle/poll_state: poll via smp_cond_load_relaxed()
>
> arch/Kconfig | 3 +++
> arch/arm64/Kconfig | 7 +++++++
> arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
> arch/arm64/include/asm/thread_info.h | 2 ++
> arch/arm64/kernel/idle.c | 1 +
> arch/x86/Kconfig | 5 ++---
> arch/x86/include/asm/cpuidle_haltpoll.h | 1 +
> arch/x86/kernel/kvm.c | 13 ++++++++++++
> drivers/acpi/processor_idle.c | 4 ++--
> drivers/cpuidle/Kconfig | 5 ++---
> drivers/cpuidle/Makefile | 2 +-
> drivers/cpuidle/cpuidle-haltpoll.c | 12 +-----------
> drivers/cpuidle/governors/haltpoll.c | 6 +-----
> drivers/cpuidle/poll_state.c | 22 +++++++++++++++------
> drivers/idle/Kconfig | 1 +
> include/linux/cpuidle.h | 2 +-
> include/linux/cpuidle_haltpoll.h | 5 +++++
> 17 files changed, 83 insertions(+), 32 deletions(-)
> create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>

Hi Ankur,

Thanks for the patches!

We have tested these patches on our machine, with an adaptation of ACPI LPI
states rather than c-states.

Include polling state, there would be three states to get in. Comparing idle
switching latencies of different state with perf bench sched pipe:

usecs/op %stdev

state0(polling state) 7.36 +- 0.35%
state1 8.78 +- 0.46%
state2 77.32 +- 5.50%

It turns out that it works on our machine.

Tested-by: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx>

The adaptation of ACPI LPI states is shown below as a patch. Feel free to
include this patch as part of your series, or I can also send it out after
your series being merged.

From: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx>

ACPI: processor_idle: Support polling state for LPI

Initialize an optional polling state besides LPI states.

Wrap up a new enter method to correctly reflect the actual entered state
when the polling state is enabled.

Signed-off-by: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx>
Reviewed-by: Jie Zhan <zhanjie9@xxxxxxxxxxxxx>
---
drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 44096406d65d..d154b5d77328 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
return -EINVAL;
}

+/* To correctly reflect the entered state if the poll state is enabled. */
+static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ int entered_state;
+
+ if (unlikely(index < 1))
+ return -EINVAL;
+
+ entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
+ if (entered_state < 0)
+ return entered_state;
+
+ return entered_state + 1;
+}
+
static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
{
- int i;
+ int i, count;
struct acpi_lpi_state *lpi;
struct cpuidle_state *state;
struct cpuidle_driver *drv = &acpi_idle_driver;
+ typeof(state->enter) enter_method;

if (!pr->flags.has_lpi)
return -EOPNOTSUPP;

+ if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
+ cpuidle_poll_state_init(drv);
+ count = 1;
+ enter_method = acpi_idle_lpi_enter_with_poll_state;
+ } else {
+ count = 0;
+ enter_method = acpi_idle_lpi_enter;
+ }
+
for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
lpi = &pr->power.lpi_states[i];

- state = &drv->states[i];
+ state = &drv->states[count];
snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
state->exit_latency = lpi->wake_latency;
@@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
state->flags |= CPUIDLE_FLAG_RCU_IDLE;
- state->enter = acpi_idle_lpi_enter;
- drv->safe_state_index = i;
+ state->enter = enter_method;
+ drv->safe_state_index = count;
+ count++;
+ if (count == CPUIDLE_STATE_MAX)
+ break;
}

- drv->state_count = i;
+ drv->state_count = count;

return 0;
}
--
2.33.0