[RFC PATCH v2 0/5] Idle Load Balance fixes and softirq enhancements

From: K Prateek Nayak
Date: Wed Sep 04 2024 - 07:15:18 EST


Hello folks,

Sorry for the delay in posting but this is the v2 of idle load balance
fixes, the previous version of which can be found at [1]. This was
broken out into a separate series since it tries to modify some
PREEMPT_RT bits that requires some clarifications on, hence the RFC. So
without further ado ...

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()"), an idle CPU in TIF_POLLING_NRFLAG can
be pulled out of idle by setting TIF_NEED_RESCHED instead of sending an
actual IPI. This affects at least three scenarios that have been
described below:

1. A need_resched() check within a call function does not necessarily
indicate a task wakeup since a CPU intending to send an IPI to an
idle target in TIF_POLLING_NRFLAG mode can simply queue the
SMP-call-function and set the TIF_NEED_RESCHED flag to pull the
polling target out of idle. The SMP-call-function will be executed by
flush_smp_call_function_queue() on the idle-exit path. On x86, where
mwait_idle_with_hints() sets TIF_POLLING_NRFLAG for long idling,
this leads to idle load balancer bailing out early since
need_resched() check in nohz_csd_func() returns true in most
instances.

2. A TIF_POLLING_NRFLAG idling CPU woken up to process an IPI will end
up calling schedule() even in cases where the call function does not
wake up a new task on the idle CPU, thus delaying the idle re-entry.

3. Julia Lawall reported a case where a softirq raised from a
SMP-call-function on an idle CPU will wake up ksoftirqd since
flush_smp_call_function_queue() executes in the idle thread's
context. This can throw off the idle load balancer by making the idle
CPU appear busy since ksoftirqd just woke on the said CPU [2].

Solution to (2.) was sent independently in [3] since it was not
dependent on the changes enclosed in this series which reworks some
PREEMPT_RT specific bits.

(1.) Was solved by dropping the need_resched() check in nohz_csd_func()
(please refer Patch 2/5 for the full version of the explanation) which
led to a splat on PREEMPT_RT kernels [4].

Since flush_smp_call_function_queue() and the following
do_softirq_post_smp_call_flush() runs with interrupts disabled, it is
not ideal for the IRQ handlers to raise a SOFTIRQ, prolonging the IRQs
disabled section especially on PREEMPT_RT kernels. For the time being,
the WARN_ON_ONCE() in do_softirq_post_smp_call_flush() has been adjusted
to allow raising a SCHED_SOFTIRQ from flush_smp_call_function_queue()
however its merit can be debated on this RFC.

With the above solution, problem discussed in (3.) is even more
prominent with idle load balancing waking up ksoftirqd to unnecessarily
(please refer Patch 5/5 for a detailed explanation). v1 attempted to
solve this by introducing a per-cpu variable to keep track on an
impending call to do_softirq(). Peter suggested reusing the
softirq_ctrl::cnt that PREEMPT_RT uses to prevent wakeup of ksoftirqd
and unifying should_wakeup_ksoftirqd() [5]. Patch 3 and 4 prepares for
this unification and Patch 5 adds and uses a new interface for
flush_smp_call_function_queue() to convey that a call do do_softirq() is
pending and there is no need to wakeup ksoftirqd.

Chenyu had reported a regression when running a modified version of
ipistorm that performs a fixed set of IPIs between two CPUs on his
setup with the whole v1 applied. I've benchmarked this series on both an
AMD and an Intel system to catch any significant regression early.
Following are the numbers from a dual socket Intel Ice Lake Xeon server
(2 x 32C/64T) and 3rd Generation AMD EPYC system (2 x 64C/128T) running
ipistorm between CPU8 and CPU16 (unless stated otherwise with *):

base: tip/master at commit 5566819aeba0 ("Merge branch into tip/master:
'x86/timers'") based on v6.11-rc6 + Patch from [1]

==================================================================
Test : ipistorm (modified)
Units : % improvement over base kernel
Interpretation: Higher is better
======================= Intel Ice Lake Xeon ======================
kernel: [pct imp]
performance gov, boost on -3%
powersave gov, boost on -2%
performance gov, boost off -3%
performance gov, boost off, cross node * -3%
==================== 3rd Generation AMD EPYC =====================
kernel: [pct imp]
performance gov, boost on, !PREEMPT_RT 36%
performance gov, boost on, PREEMPT_RT 54%
==================================================================

* cross node setup used CPU 16 on Node 0 and CPU 17 on Node 1 on the
dual socket Intel Ice Lake Xeon system.

Improvements on PREEMPT_RT can perhaps be attributed to cacheline
aligning the per-cpu softirq_ctrl variable.

This series has been marked RFC since this is my first attempt at
dealing with PREEMPT_RT nuances. Any and all feedback is appreciated.

[1] https://lore.kernel.org/lkml/20240710090210.41856-1-kprateek.nayak@xxxxxxx/
[2] https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@xxxxxxxx/
[3] https://lore.kernel.org/lkml/20240809092240.6921-1-kprateek.nayak@xxxxxxx/
[4] https://lore.kernel.org/lkml/225e6d74-ed43-51dd-d1aa-c75c86dd58eb@xxxxxxx/
[5] https://lore.kernel.org/lkml/20240710150557.GB27299@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
---
v1..v2:

- Broke the PREEMPT_RT unification and idle load balance fixes into
separate series (this) and post the SM_IDLE fast-path enhancements
separately.

- Worked around the splat on PREEMPT_RT kernel caused by raising
SCHED_SOFTIRQ from nohz_csd_func() in context of
flush_smp_call_function_queue() which is undesirable on PREEMPT_RT
kernels. (Please refer to commit 1a90bfd22020 ("smp: Make softirq
handling RT safe in flush_smp_call_function_queue()")

- Reuse softirq_ctrl::cnt from PREEMPT_RT to prevent unnecessary
wakeups of ksoftirqd. (Peter)
This unifies should_wakeup_ksoftirqd() and adds an interface to
indicate an impending call to do_softirq (set_do_softirq_pending())
and clear it just before fulfilling the promise
(clr_do_softirq_pending()).

- More benchmarking.

--
K Prateek Nayak (5):
softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT
kernel
sched/core: Remove the unnecessary need_resched() check in
nohz_csd_func()
softirq: Mask reads of softirq_ctrl.cnt with SOFTIRQ_MASK for
PREEMPT_RT
softirq: Unify should_wakeup_ksoftirqd()
softirq: Avoid unnecessary wakeup of ksoftirqd when a call to
do_sofirq() is pending

kernel/sched/core.c | 2 +-
kernel/sched/smp.h | 9 +++++
kernel/smp.c | 2 +
kernel/softirq.c | 97 +++++++++++++++++++++++++++------------------
4 files changed, 71 insertions(+), 39 deletions(-)

--
2.34.1