Re: [RFC][PATCH] x86: optimization to avoid CAL+RES IPIs

From: Andy Lutomirski
Date: Fri Jul 17 2020 - 23:14:53 EST


> On Jul 17, 2020, at 7:13 PM, Josh Don <joshdon@xxxxxxxxxx> wrote:
>
> ïFrom: Venkatesh Pallipadi <venki@xxxxxxxxxx>
>
> smp_call_function_single and smp_send_reschedule send unconditional IPI
> to target CPU. However, if the target CPU is in some form of poll based
> idle, we can do IPI-less wakeups.
>
> Doing this has certain advantages:
> * Lower overhead on Async "no wait" IPI send path.
> * Avoiding actual interrupts reduces system non-idle cycles.
>
> Note that this only helps when target CPU is idle. When it is busy we
> will still send an IPI as before.

PeterZ and I fixed a whole series of bugs a few years ago, and remote
wakeups *should* already do this. Did we miss something? Did it
regress? Even the call_function_single path ought to go through this:

void send_call_function_single_ipi(int cpu)
{
struct rq *rq = cpu_rq(cpu);

if (!set_nr_if_polling(rq->idle))
arch_send_call_function_single_ipi(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
}


>
> *** RFC NOTE ***
> This patch breaks idle time accounting (and to a lesser degree, softirq
> accounting). This is because this patch violates the assumption that
> softirq can only be run either on the tail of a hard IRQ or inline on
> a non-idle thread via local_bh_enable(), since we can now process
> softirq inline within the idle loop. These ssues can be resolved in a
> later version of this patch.
>
> Signed-off-by: Josh Don <joshdon@xxxxxxxxxx>
> ---
> arch/x86/include/asm/mwait.h | 5 +-
> arch/x86/include/asm/processor.h | 1 +
> arch/x86/include/asm/thread_info.h | 2 +
> arch/x86/kernel/apic/ipi.c | 8 +++
> arch/x86/kernel/smpboot.c | 4 ++
> drivers/cpuidle/poll_state.c | 5 +-
> include/linux/ipiless_wake.h | 93 ++++++++++++++++++++++++++++++
> kernel/sched/idle.c | 10 +++-
> 8 files changed, 124 insertions(+), 4 deletions(-)
> create mode 100644 include/linux/ipiless_wake.h
>
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index e039a933aca3..aed393f38a39 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -2,6 +2,7 @@
> #ifndef _ASM_X86_MWAIT_H
> #define _ASM_X86_MWAIT_H
>
> +#include <linux/ipiless_wake.h>
> #include <linux/sched.h>
> #include <linux/sched/idle.h>
>
> @@ -109,6 +110,7 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
> static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> {
> if (static_cpu_has_bug(X86_BUG_MONITOR) || !current_set_polling_and_test()) {
> + enter_ipiless_idle();
> if (static_cpu_has_bug(X86_BUG_CLFLUSH_MONITOR)) {
> mb();
> clflush((void *)&current_thread_info()->flags);
> @@ -116,8 +118,9 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> }
>
> __monitor((void *)&current_thread_info()->flags, 0, 0);
> - if (!need_resched())
> + if (!is_ipiless_wakeup_pending())
> __mwait(eax, ecx);
> + exit_ipiless_idle();
> }
> current_clr_polling();
> }
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 03b7c4ca425a..045fc9bbd095 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -568,6 +568,7 @@ static inline void arch_thread_struct_whitelist(unsigned long *offset,
> * have to worry about atomic accesses.
> */
> #define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
> +#define TS_IPILESS_WAKEUP 0x0010 /* pending IPI-work on idle exit */

What's this for?

> +#define _TIF_IN_IPILESS_IDLE (1 << TIF_IN_IPILESS_IDLE)

We already have TIF_POLLING_NRFLAG. Why do we need this?


> #include "local.h"
> @@ -67,11 +68,18 @@ void native_smp_send_reschedule(int cpu)
> WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
> return;
> }
> +
> + if (try_ipiless_wakeup(cpu))
> + return;
> +

I think this shouldnât be called if the target is idle unless we lost
a race. What am I missing?

> apic->send_IPI(cpu, RESCHEDULE_VECTOR);
> }
>
> void native_send_call_func_single_ipi(int cpu)
> {
> + if (try_ipiless_wakeup(cpu))
> + return;
> +
> apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
> }

This function's caller already does this.

> +static inline void exit_ipiless_idle(void)
> +{
> + if (!test_and_clear_thread_flag(TIF_IN_IPILESS_IDLE)) {

This has the unfortunate property that, if you try to wake the same
polling CPU twice in rapid succession, the second try will send an
IPI. One might debate whether this is a real problem, but the
existing code does not have this issue.

I could poke at this more, but I'm wondering whether you might have
developed this against a rather old kernel and blindly forward-ported
it.

--Andy