Re: [PATCH v8 00/11] Enable haltpoll on arm64

From: Ankur Arora
Date: Mon Oct 14 2024 - 21:54:47 EST



zhenglifeng (A) <zhenglifeng1@xxxxxxxxxx> writes:

> On 2024/9/26 7:24, Ankur Arora wrote:
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>> usecs/op %stdev
>>
>> no haltpoll (baseline) 13.48 +- 5.19%
>> with haltpoll 6.84 +- 22.07%
>>
>>
>> No change in performance for a similar test on x86:
>>
>> usecs/op %stdev
>>
>> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76%
>> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31%
>>
>> Both sets of tests were on otherwise idle systems with guest VCPUs
>> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> is that trapping of the WFE instruction by the host KVM is contingent
>> on the number of tasks on the runqueue.
>>
>> Tomohiro Misono and Haris Okanovic also report similar latency
>> improvements on Grace and Graviton systems (for v7) [1] [2].
>>
>> The patch series is organized in three parts:
>>
>> - patch 1, reorganizes the poll_idle() loop, switching to
>> smp_cond_load_relaxed() in the polling loop.
>> Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>> renaming it to ARCH_HAS_OPTIMIZED_POLL.
>>
>> - patches 4-6 reorganize the haltpoll selection and init logic
>> to allow architecture code to select it.
>>
>> - and finally, patches 7-11 add the bits for arm64 support.
>>
>> What is still missing: this series largely completes the haltpoll side
>> of functionality for arm64. There are, however, a few related areas
>> that still need to be threshed out:
>>
>> - WFET support: WFE on arm64 does not guarantee that poll_idle()
>> would terminate in halt_poll_ns. Using WFET would address this.
>> - KVM_NO_POLL support on arm64
>> - KVM TWED support on arm64: allow the host to limit time spent in
>> WFE.
>>
>>
>> Changelog:
>>
>> v8: No logic changes. Largely respin of v7, with changes
>> noted below:
>>
>> - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>> own patch.
>> (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>>
>> - address comments simplifying arm64 support (Will Deacon)
>> (patch-11 "arm64: support cpuidle-haltpoll")
>>
>> v7: No significant logic changes. Mostly a respin of v6.
>>
>> - minor cleanup in poll_idle() (Christoph Lameter)
>> - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>> (Tomohiro Misono)
>>
>> v6:
>>
>> - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>> changes together (comment from Christoph Lameter)
>> - threshes out the commit messages a bit more (comments from Christoph
>> Lameter, Sudeep Holla)
>> - also rework selection of cpuidle-haltpoll. Now selected based
>> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>> - moved back to arch_haltpoll_want() (comment from Joao Martins)
>> Also, arch_haltpoll_want() now takes the force parameter and is
>> now responsible for the complete selection (or not) of haltpoll.
>> - fixes the build breakage on i386
>> - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>> Tomohiro Misono, Haris Okanovic)
>>
>>
>> v5:
>> - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>> comment from Tomohiro Misono.)
>> - also rework selection of cpuidle-haltpoll. Now selected based
>> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>> - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>> arm64 now depends on the event-stream being enabled.
>> - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>> - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>>
>> v4 changes from v3:
>> - change 7/8 per Rafael input: drop the parens and use ret for the final check
>> - add 8/8 which renames the guard for building poll_state
>>
>> v3 changes from v2:
>> - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>> - add Ack-by from Rafael Wysocki on 2/7
>>
>> v2 changes from v1:
>> - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>> (this improves by 50% at least the CPU cycles consumed in the tests above:
>> 10,716,881,137 now vs 14,503,014,257 before)
>> - removed the ifdef from patch 1 per RafaelW
>>
>> Please review.
>>
>> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@xxxxxxxxxx/
>>
>> Ankur Arora (6):
>> cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>> cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>> arm64: idle: export arch_cpu_idle
>> arm64: select ARCH_HAS_OPTIMIZED_POLL
>> cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
>> arm64: support cpuidle-haltpoll
>>
>> Joao Martins (4):
>> Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>> cpuidle-haltpoll: define arch_haltpoll_want()
>> governors/haltpoll: drop kvm_para_available() check
>> arm64: define TIF_POLLING_NRFLAG
>>
>> Mihai Carabas (1):
>> cpuidle/poll_state: poll via smp_cond_load_relaxed()
>>
>> arch/Kconfig | 3 +++
>> arch/arm64/Kconfig | 7 +++++++
>> arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>> arch/arm64/include/asm/thread_info.h | 2 ++
>> arch/arm64/kernel/idle.c | 1 +
>> arch/x86/Kconfig | 5 ++---
>> arch/x86/include/asm/cpuidle_haltpoll.h | 1 +
>> arch/x86/kernel/kvm.c | 13 ++++++++++++
>> drivers/acpi/processor_idle.c | 4 ++--
>> drivers/cpuidle/Kconfig | 5 ++---
>> drivers/cpuidle/Makefile | 2 +-
>> drivers/cpuidle/cpuidle-haltpoll.c | 12 +-----------
>> drivers/cpuidle/governors/haltpoll.c | 6 +-----
>> drivers/cpuidle/poll_state.c | 22 +++++++++++++++------
>> drivers/idle/Kconfig | 1 +
>> include/linux/cpuidle.h | 2 +-
>> include/linux/cpuidle_haltpoll.h | 5 +++++
>> 17 files changed, 83 insertions(+), 32 deletions(-)
>> create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>>
>
> Hi Ankur,
>
> Thanks for the patches!
>
> We have tested these patches on our machine, with an adaptation of ACPI LPI
> states rather than c-states.
>
> Include polling state, there would be three states to get in. Comparing idle
> switching latencies of different state with perf bench sched pipe:
>
> usecs/op %stdev
>
> state0(polling state) 7.36 +- 0.35%
> state1 8.78 +- 0.46%
> state2 77.32 +- 5.50%
>
> It turns out that it works on our machine.
>
> Tested-by: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx>

Great. Thanks Lifeng.

> The adaptation of ACPI LPI states is shown below as a patch. Feel free to
> include this patch as part of your series, or I can also send it out after
> your series being merged.

Ah, so polling for the regular ACPI driver. From a quick look the
patch looks good but this series is mostly focused on haltpoll so I
think this patch can go in after.

Please Cc me when you send it.

Thanks
Ankur

> From: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx>
>
> ACPI: processor_idle: Support polling state for LPI
>
> Initialize an optional polling state besides LPI states.
>
> Wrap up a new enter method to correctly reflect the actual entered state
> when the polling state is enabled.
>
> Signed-off-by: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx>
> Reviewed-by: Jie Zhan <zhanjie9@xxxxxxxxxxxxx>
> ---
> drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
> 1 file changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 44096406d65d..d154b5d77328 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
> return -EINVAL;
> }
>
> +/* To correctly reflect the entered state if the poll state is enabled. */
> +static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv, int index)
> +{
> + int entered_state;
> +
> + if (unlikely(index < 1))
> + return -EINVAL;
> +
> + entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
> + if (entered_state < 0)
> + return entered_state;
> +
> + return entered_state + 1;
> +}
> +
> static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
> {
> - int i;
> + int i, count;
> struct acpi_lpi_state *lpi;
> struct cpuidle_state *state;
> struct cpuidle_driver *drv = &acpi_idle_driver;
> + typeof(state->enter) enter_method;
>
> if (!pr->flags.has_lpi)
> return -EOPNOTSUPP;
>
> + if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
> + cpuidle_poll_state_init(drv);
> + count = 1;
> + enter_method = acpi_idle_lpi_enter_with_poll_state;
> + } else {
> + count = 0;
> + enter_method = acpi_idle_lpi_enter;
> + }
> +
> for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
> lpi = &pr->power.lpi_states[i];
>
> - state = &drv->states[i];
> + state = &drv->states[count];
> snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
> strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
> state->exit_latency = lpi->wake_latency;
> @@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
> state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
> if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
> state->flags |= CPUIDLE_FLAG_RCU_IDLE;
> - state->enter = acpi_idle_lpi_enter;
> - drv->safe_state_index = i;
> + state->enter = enter_method;
> + drv->safe_state_index = count;
> + count++;
> + if (count == CPUIDLE_STATE_MAX)
> + break;
> }
>
> - drv->state_count = i;
> + drv->state_count = count;
>
> return 0;
> }


--
ankur