On Fri, 9 Aug 2024 at 11:22, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.
Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.
With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:
cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
======================= Intel Ice Lake Xeon ======================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.80 [20.51%]
==================== 3rd Generation AMD EPYC =====================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.90 [10.17%]
==================================================================
[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]
Link: https://lore.kernel.org/lkml/20240615012814.GP8774@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
Not-yet-signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Acked-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
v1..v2:
- Fixed SM_RTLOCK_WAIT being considered as preemption for task state
change on PREEMPT_RT kernels. Since (sched_mode & SM_MASK_PREEMPT) was
used in a couple of places, I decided to reuse the preempt variable.
(Vincent, Peter)
- Seperated this patch from the newidle_balance() fixes series since
there are PREEMPT_RT bits that requires deeper review whereas this is
an independent enhancement on its own.
What is the status of the other part of v1 patchset to run idle load
balance instead of newly idle load balance ?
[..snip..]