Re: [PATCH] sched/idle: Make idle poll dynamic per-cpu

From: Daniel Bristot de Oliveira
Date: Tue Jan 17 2023 - 06:21:07 EST


On 1/15/23 10:15, Ingo Molnar wrote:
>
> * Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:
>
>> idle=poll is frequently used on ultra-low-latency systems. Examples of
>> such systems are high-performance trading and 5G NVRAM. The performance
>> gain is given by avoiding the idle driver machinery and by keeping the
>> CPU is always in an active state - avoiding (odd) hardware heuristics that
>> are out of the control of the OS.
>>
>> Currently, idle=poll is an all-or-nothing static option defined at
>> boot time. The motivation for creating this option dynamic and per-cpu
>> are two:
>>
>> 1) Reduce the power usage/heat by allowing only selected CPUs to
>> do idle polling;
>> 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>> polling only when ultra-low-latency applications are present
>> on specific CPUs.
>>
>> Joe Mario did some experiments with this option enabled, and the results
>> were significant. For example, by using dynamic idle polling on
>> selected CPUs, cyclictest performance is optimal (like when using
>> idle=poll), but cpu power consumption drops from 381 to 233 watts.
>>
>> Also, limiting idle=poll to the set of CPUs that benefits from
>> it allows other CPUs to benefit from frequency boosts. Joe also
>> shows that the results can be in the order of 80nsec round trip
>> improvement when system-wide idle=poll was not used.
>>
>> The user can enable idle polling with this command:
>> # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> And disable it via:
>> # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> By default, all CPUs have idle polling disabled (the current behavior).
>> A static key avoids the CPU mask check overhead when no idle polling
>> is enabled.
>
> Sounds useful in general.
>
> A couple of observations:
>
> ABI: how about putting the new file into the existing
> /sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle?
> Arguably this flag is an extension of it.
>

I tried that, but then this option will depend on CONFIG_CPU_IDLE, which... is not
away set, and idle_poll does not depend on now... so I am not sure if it is
the best option... or am I missing something? suggestions?

>> extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>
>> +/*
>> + * per-cpu idle polling selector.
>> + */
>> +static struct cpumask cpu_poll_mask;
>> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
>> +
>> +/*
>> + * Protects the mask/static key relation.
>> + */
>> +DEFINE_MUTEX(cpu_poll_mutex);
>> +
>> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
>> + const char *buf, size_t count)
>> +{
>> + int cpu = dev->id;
>> + int retval, set;
>> + bool val;
>> +
>> + retval = kstrtobool(buf, &val);
>> + if (retval)
>> + return retval;
>> +
>> + mutex_lock(&cpu_poll_mutex);
>> +
>> + if (val) {
>> + set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
>> +
>> + /*
>> + * If the CPU was already on, do not increase the static key usage.
>> + */
>> + if (!set)
>> + static_branch_inc(&cpu_poll_enabled);
>> + } else {
>> + set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
>> +
>> + /*
>> + * If the CPU was already off, do not decrease the static key usage.
>> + */
>> + if (set)
>> + static_branch_dec(&cpu_poll_enabled);
>> + }
>
> Nit: I think 'old_bit' or so is easier to read than a generic 'set'?

ack

>
>> +
>> + mutex_unlock(&cpu_poll_mutex);
>
> Also, is cpu_poll_mutex locking really necessary, given that these bitops
> methods are atomic, and CPUs observe cpu_poll_enabled without taking any
> locks?

you are right, it is not needed. I will remove it.

>> +static int is_cpu_idle_poll(int cpu)
>> +{
>> + if (static_branch_unlikely(&cpu_poll_enabled))
>> + return cpumask_test_cpu(cpu, &cpu_poll_mask);
>> +
>> + return 0;
>> +}
>
> static inline might be justified in this case I guess.

ack

>> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>>
>> static noinline int __cpuidle cpu_idle_poll(void)
>> {
>> - trace_cpu_idle(0, smp_processor_id());
>> + int cpu = smp_processor_id();
>> +
>> + trace_cpu_idle(0, cpu);
>> stop_critical_timings();
>> ct_idle_enter();
>> local_irq_enable();
>>
>> while (!tif_need_resched() &&
>> - (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> + (cpu_idle_force_poll || tick_check_broadcast_expired()
>> + || is_cpu_idle_poll(cpu)))
>> cpu_relax();
>>
>> ct_idle_exit();
>> start_critical_timings();
>> - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>> + trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>>
>> return 1;
>
> So I think the introduction of the 'cpu' local variable to clean up the
> flow of cpu_idle_poll() should be a separate preparatory patch, which will
> make the addition of the is_cpu_idle_poll() call a bit easier to read in
> the second patch.

Makes sense.

>> }
>> @@ -296,7 +384,8 @@ static void do_idle(void)
>> * broadcast device expired for us, we don't want to go deep
>> * idle as we know that the IPI is going to arrive right away.
>> */
>> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>> + if (cpu_idle_force_poll || tick_check_broadcast_expired()
>> + || is_cpu_idle_poll(cpu)) {
>> tick_nohz_idle_restart_tick();
>> cpu_idle_poll();
>
> Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll
> check, and before the tick_check_broadcast_expired() check?

Right.

> Shouldn't matter to the outcome, but for consistency's sake.

Maybe, we can move the cpu_idle_force_poll check inside cpu_idle_force_poll()?

because...

> Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be
> implemented as 0/all setting of cpu_poll_mask, eliminating the
> cpu_idle_force_poll flag? As a third patch on top.

I started doing it, but then I noticed some points:

- the cpu_idle_force_poll can stack, as platforms can call cpu_idle_poll_ctrl(true)
on top of idle=poll. So we would still need an integer to count how many times the
cpu_idle_force_poll was called.

- call to cpu_idle_poll_ctrl(false) when cpu_idle_force_poll reaches 0 cannot
unset all bits from the cpu_poll_mask because the user setup would be lost.

So I think that cpu_idle_force_poll is being used for two purposes: 1) user setting
via idle=poll, and 2) as a kernel facility via cpu_idle_poll_ctrl(true/false) other
than idle=poll.

So, maybe we can make idle=poll to change the initial value of the bitmask to
all 1 (with the addition that the user can now undo it), and keep cpu_idle_force_poll
for internal use?

Thanks!
-- Daniel
>
> Thanks,
>
> Ingo