Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

From: Ankur Arora
Date: Fri Apr 05 2024 - 16:22:04 EST



Okanovic, Haris <harisokn@xxxxxxxxxx> writes:

> On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender
>> and know the content is safe.
>>
>>
>>
>> Mark Rutland <mark.rutland@xxxxxxx> writes:
>>
>> > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
>> > > An arm64 cpuidle driver with two states: (1) First polls for new
>> > > runable
>> > > tasks up to 100 us (by default) before (2) a wfi idle and awoken
>> > > by
>> > > interrupt (the current arm64 behavior). It allows CPUs to return
>> > > from
>> > > idle more quickly by avoiding the longer interrupt wakeup path,
>> > > which
>> > > may require EL1/EL2 transition in certain VM scenarios.
>> >
>> > Please start off with an explanation of the problem you're trying
>> > to solve
>> > (which IIUC is to wake up more quickly in certain cases), before
>> > describing the
>> > solution. That makes it *significantly* easier for people to review
>> > this, since
>> > once you have the problem statement in mind it's much easier to
>> > understand how
>> > the solution space follows from that.
>> >
>> > > Poll duration is optionally configured at load time via the
>> > > poll_limit
>> > > module parameter.
>> >
>> > Why should this be a configurable parameter?
>> >
>> > (note, at this point you haven't introduced any of the data below,
>> > so the
>> > trade-off isn't clear to anyone).
>> >
>> > > The default 100 us duration was experimentally chosen, by
>> > > measuring QPS
>> > > (queries per sec) of the MLPerf bert inference benchmark, which
>> > > seems
>> > > particularly susceptible to this change; see procedure below. 100
>> > > us is
>> > > the inflection point where QPS stopped growing in a range of
>> > > tested
>> > > values. All results are from AWS m7g.16xlarge instances
>> > > (Graviton3 SoC)
>> > > with dedicated tenancy (dedicated hardware).
>> > >
>> > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
>> > > > 300us |
>> > > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
>> > > > 6.06 |
>> > >
>> > > Perf's scheduler benchmarks also improve with a range of
>> > > poll_limit
>> > > values >= 10 us. Higher limits produce near identical results
>> > > within a
>> > > 3% noise margin. The following tables are `perf bench sched`
>> > > results,
>> > > run times in seconds.
>> > >
>> > > `perf bench sched messaging -l 80000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
>> > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
>> > > > c6g.metal | Graviton2 | 17.621 | 16.744 | none |
>> > > > c7g.metal | Graviton3 | 13.430 | 13.404 | none |
>> > >
>> > > `perf bench sched pipe -l 2500000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
>> > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
>> > > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
>> > > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
>> > >
>> > > `perf bench sched seccomp-notify -l 2500000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
>> > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
>> > > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
>> > > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |
>> >
>> > Ok, so perf numbers for a busy workload go up.
>> >
>> > What happens for idle state residency on a mostly idle system?
>> >
>> > > Steps to run MLPerf bert inference on Ubuntu 22.04:
>> > > sudo apt install build-essential python3 python3-pip
>> > > pip install "pybind11[global]" tensorflow transformers
>> > > export TF_ENABLE_ONEDNN_OPTS=1
>> > > export DNNL_DEFAULT_FPMATH_MODE=BF16
>> > > git clone https://github.com/mlcommons/inference.git --recursive
>> > > cd inference
>> > > git checkout v2.0
>> > > cd loadgen
>> > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
>> > > pip install dist/*.whl
>> > > cd ../language/bert
>> > > make setup
>> > > python3 run.py --backend=tf --scenario=SingleStream
>> > >
>> > > Suggested-by: Ali Saidi <alisaidi@xxxxxxxxxx>
>> > > Reviewed-by: Ali Saidi <alisaidi@xxxxxxxxxx>
>> > > Reviewed-by: Geoff Blake <blakgeof@xxxxxxxxxx>
>> > > Cc: Brian Silver <silverbr@xxxxxxxxxx>
>> > > Signed-off-by: Haris Okanovic <harisokn@xxxxxxxxxx>
>> > > ---
>> > > drivers/cpuidle/Kconfig.arm | 13 ++
>> > > drivers/cpuidle/Makefile | 1 +
>> > > drivers/cpuidle/cpuidle-arm-polling.c | 171
>> > > ++++++++++++++++++++++++++
>> > > 3 files changed, 185 insertions(+)
>> > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
>> > >
>> > > diff --git a/drivers/cpuidle/Kconfig.arm
>> > > b/drivers/cpuidle/Kconfig.arm
>> > > index a1ee475d180d..484666dda38d 100644
>> > > --- a/drivers/cpuidle/Kconfig.arm
>> > > +++ b/drivers/cpuidle/Kconfig.arm
>> > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
>> > > initialized by calling the CPU operations init idle hook
>> > > provided by architecture code.
>> > >
>> > > +config ARM_POLL_CPUIDLE
>> > > + bool "ARM64 CPU idle Driver with polling"
>> > > + depends on ARM64
>> > > + depends on ARM_ARCH_TIMER_EVTSTREAM
>> > > + select CPU_IDLE_MULTIPLE_DRIVERS
>> > > + help
>> > > + Select this to enable a polling cpuidle driver for ARM64:
>> > > + The first state polls TIF_NEED_RESCHED for best latency on
>> > > short
>> > > + sleep intervals. The second state falls back to
>> > > arch_cpu_idle() to
>> > > + wait for interrupt. This is can be helpful in workloads
>> > > that
>> > > + frequently block/wake at short intervals or VMs where
>> > > wakeup IPIs
>> > > + are more expensive.
>> >
>> > Why is this a separate driver rather than an optional feature in
>> > the existing
>> > driver?
>> >
>> > The fact that this duplicates a bunch of code indicates to me that
>> > this should
>> > not be a separate driver.
>>
>> Also, the cpuidle-haltpoll driver is meant to do something quite
>> similar.
>> That driver polls adaptively based on the haltpoll governor's tuning
>> of
>> the polling period.
>>
>> However, cpuidle-haltpoll is currently x86 only. Mihai (also from
>> Oracle)
>> posted patches [1] adding support for ARM64.
>>
>> Haris, could you take a look at it and see if it does what you are
>> looking for? The polling path in the linked version also uses
>> smp_cond_load_relaxed() so even the mechanisms for both of these
>> are fairly similar.
>
> Hi Ankur,
>
> I agree, except for that small bug in exit condition, your haltpoll
> changes fundamentally do the same thing:

Yup. Will address that bug and a few other things in the next version.

>> @ int __cpuidle poll_idle(...
>> - if (!(ret & _TIF_NEED_RESCHED))
>> + if (ret & _TIF_NEED_RESCHE
>
> I'll follow up with another patch for AWS Graviton when your team is
> finished.
>
> Do you have a rough ETA of when your changes will land in master?

That I guess would be determined by the maintainers, but I should be
able to send it out the coming week.

Thanks
Ankur

>>
>> (I'll be sending out the next version shortly. Happy to Cc you if you
>> would like to try that out.)
>
> Yes, please do!
>
> Thanks,
> Haris Okanovic
>
>>
>> Thanks
>> Ankur
>>
>> [1]
>> https://lore.kernel.org/lkml/1707982910-27680-1-git-send-email-mihai.carabas@xxxxxxxxxx/
>>
>> >
>> > > +
>> > > config ARM_PSCI_CPUIDLE
>> > > bool "PSCI CPU idle Driver"
>> > > depends on ARM_PSCI_FW
>> > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
>> > > index d103342b7cfc..23c21422792d 100644
>> > > --- a/drivers/cpuidle/Makefile
>> > > +++ b/drivers/cpuidle/Makefile
>> > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) +=
>> > > cpuidle-ux500.o
>> > > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
>> > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
>> > > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
>> > > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-
>> > > polling.o
>> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
>> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-
>> > > domain.o
>> > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
>> > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
>> > > b/drivers/cpuidle/cpuidle-arm-polling.c
>> > > new file mode 100644
>> > > index 000000000000..bca128568114
>> > > --- /dev/null
>> > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
>> > > @@ -0,0 +1,171 @@
>> > > +// SPDX-License-Identifier: GPL-2.0
>> > > +/*
>> > > + * ARM64 CPU idle driver using wfe polling
>> > > + *
>> > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
>> > > reserved.
>> > > + *
>> > > + * Authors:
>> > > + * Haris Okanovic <harisokn@xxxxxxxxxx>
>> > > + * Brian Silver <silverbr@xxxxxxxxxx>
>> > > + *
>> > > + * Based on cpuidle-arm.c
>> > > + * Copyright (C) 2014 ARM Ltd.
>> > > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx>
>> > > + */
>> > > +
>> > > +#include <linux/cpu.h>
>> > > +#include <linux/cpu_cooling.h>
>> > > +#include <linux/cpuidle.h>
>> > > +#include <linux/sched/clock.h>
>> > > +
>> > > +#include <asm/cpuidle.h>
>> > > +#include <asm/readex.h>
>> > > +
>> > > +#include "dt_idle_states.h"
>> > > +
>> > > +/* Max duration of the wfe() poll loop in us, before
>> > > transitioning to
>> > > + * arch_cpu_idle()/wfi() sleep.
>> > > + */
>> >
>> > /*
>> > * Comments should have the leading '/*' on a separate line.
>> > * See
>> > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
>> > */
>> >
>> > > +#define DEFAULT_POLL_LIMIT_US 100
>> > > +static unsigned int poll_limit __read_mostly =
>> > > DEFAULT_POLL_LIMIT_US;
>> > > +
>> > > +/*
>> > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule
>> > > is
>> > > + * needed or timeout
>> > > + */
>> > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device
>> > > *dev,
>> > > + struct cpuidle_driver *drv, int idx)
>> > > +{
>> > > + u64 time_start, time_limit;
>> > > +
>> > > + time_start = local_clock();
>> > > + dev->poll_time_limit = false;
>> > > +
>> > > + local_irq_enable();
>> >
>> > Why enable IRQs here? We don't do that in the regular cpuidle-arm
>> > driver, nor
>> > the cpuidle-psci driver, and there's no explanation here or in the
>> > commit message.
>> >
>> > How does this interact with RCU? Is that still watching or are we
>> > in an
>> > extended quiescent state? For PSCI idle states we enter an EQS, and
>> > it seems
>> > like we probably should here...
>> >
>> > > +
>> > > + if (current_set_polling_and_test())
>> > > + goto end;
>> > > +
>> > > + time_limit = cpuidle_poll_time(drv, dev);
>> > > +
>> > > + do {
>> > > + // exclusive read arms the monitor for wfe
>> > > + if (__READ_ONCE_EX(current_thread_info()->flags) &
>> > > _TIF_NEED_RESCHED)
>> > > + goto end;
>> > > +
>> > > + // may exit prematurely, see
>> > > ARM_ARCH_TIMER_EVTSTREAM
>> > > + wfe();
>> > > + } while (local_clock() - time_start < time_limit);
>> >
>> > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever
>> > rather than
>> > entering a deeper idle state, which doesn't seem desirable.
>> >
>> > It's worth noting that now that we have WFET, we'll probably want
>> > to disable
>> > the EVTSTREAM by default at some point, at least in some
>> > configurations, since
>> > that'll be able to sit in a WFE state for longer while also
>> > reliably waking up
>> > when required.
>> >
>> > I suspect we want something like an smp_load_acquire_timeout() here
>> > to do the
>> > wait in arch code (allowing us to use WFET), and enabling this
>> > state will
>> > depend on either having WFET or EVTSTREAM.
>> >
>> > > +
>> > > + dev->poll_time_limit = true;
>> > > +
>> > > +end:
>> > > + current_clr_polling();
>> > > + return idx;
>> > > +}
>> > > +
>> > > +/*
>> > > + * arm_idle_wfi - Places cpu in lower power state until
>> > > interrupt,
>> > > + * a fallback to polling
>> > > + */
>> > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
>> > > + struct cpuidle_driver *drv, int idx)
>> > > +{
>> > > + if (current_clr_polling_and_test()) {
>> > > + local_irq_enable();
>> > > + return idx;
>> > > + }
>> >
>> > Same as above, why enable IRQs here?
>> >
>> > > + arch_cpu_idle();
>> > > + return idx;
>> >
>> > .. and if we need to enable IRQs in the other cases above, why do
>> > we *not*
>> > need to enable them here?
>> >
>> > > +}
>> > > +
>> > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
>> > > + .name = "arm_poll_idle",
>> > > + .owner = THIS_MODULE,
>> > > + .states = {
>> > > + {
>> > > + .enter = arm_idle_wfe_poll,
>> > > + .exit_latency = 0,
>> > > + .target_residency = 0,
>> > > + .exit_latency_ns = 0,
>> > > + .power_usage = UINT_MAX,
>> > > + .flags =
>> > > CPUIDLE_FLAG_POLLING,
>> > > + .name = "WFE",
>> > > + .desc = "ARM WFE",
>> > > + },
>> > > + {
>> > > + .enter = arm_idle_wfi,
>> > > + .exit_latency =
>> > > DEFAULT_POLL_LIMIT_US,
>> > > + .target_residency =
>> > > DEFAULT_POLL_LIMIT_US,
>> > > + .power_usage = UINT_MAX,
>> > > + .name = "WFI",
>> > > + .desc = "ARM WFI",
>> > > + },
>> > > + },
>> > > + .state_count = 2,
>> > > +};
>> >
>> > How does this interact with the existing driver?
>> >
>> > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
>> >
>> > > +
>> > > +/*
>> > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver
>> > > for one cpu
>> > > + */
>> > > +static int __init arm_poll_init_cpu(int cpu)
>> > > +{
>> > > + int ret;
>> > > + struct cpuidle_driver *drv;
>> > > +
>> > > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
>> > > GFP_KERNEL);
>> > > + if (!drv)
>> > > + return -ENOMEM;
>> > > +
>> > > + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
>> > > + drv->states[1].exit_latency = poll_limit;
>> > > + drv->states[1].target_residency = poll_limit;
>> > > +
>> > > + ret = cpuidle_register(drv, NULL);
>> > > + if (ret) {
>> > > + pr_err("failed to register driver: %d, cpu %d\n",
>> > > ret, cpu);
>> > > + goto out_kfree_drv;
>> > > + }
>> > > +
>> > > + pr_info("registered driver cpu %d\n", cpu);
>> >
>> > This does not need to be printed for each CPU.
>> >
>> > Mark.
>> >
>> > > +
>> > > + cpuidle_cooling_register(drv);
>> > > +
>> > > + return 0;
>> > > +
>> > > +out_kfree_drv:
>> > > + kfree(drv);
>> > > + return ret;
>> > > +}
>> > > +
>> > > +/*
>> > > + * arm_poll_init - Initializes arm cpuidle polling driver
>> > > + */
>> > > +static int __init arm_poll_init(void)
>> > > +{
>> > > + int cpu, ret;
>> > > + struct cpuidle_driver *drv;
>> > > + struct cpuidle_device *dev;
>> > > +
>> > > + for_each_possible_cpu(cpu) {
>> > > + ret = arm_poll_init_cpu(cpu);
>> > > + if (ret)
>> > > + goto out_fail;
>> > > + }
>> > > +
>> > > + return 0;
>> > > +
>> > > +out_fail:
>> > > + pr_info("de-register all");
>> > > + while (--cpu >= 0) {
>> > > + dev = per_cpu(cpuidle_devices, cpu);
>> > > + drv = cpuidle_get_cpu_driver(dev);
>> > > + cpuidle_unregister(drv);
>> > > + kfree(drv);
>> > > + }
>> > > +
>> > > + return ret;
>> > > +}
>> > > +
>> > > +module_param(poll_limit, uint, 0444);
>> > > +device_initcall(arm_poll_init);
>> > > --
>> > > 2.34.1
>> > >
>> > >
>>
>>
>> --
>> ankur


--
ankur