Re: [PATCH v2] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()

From: K Prateek Nayak
Date: Wed Sep 04 2024 - 07:22:32 EST


Hello Vincent,

On 9/4/2024 12:54 PM, Vincent Guittot wrote:
On Fri, 9 Aug 2024 at 11:22, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:

From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.

Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.

With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
======================= Intel Ice Lake Xeon ======================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.80 [20.51%]
==================== 3rd Generation AMD EPYC =====================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.90 [10.17%]
==================================================================


[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]

Link: https://lore.kernel.org/lkml/20240615012814.GP8774@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
Not-yet-signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>

Acked-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

Thank you for the ack.


---
v1..v2:

- Fixed SM_RTLOCK_WAIT being considered as preemption for task state
change on PREEMPT_RT kernels. Since (sched_mode & SM_MASK_PREEMPT) was
used in a couple of places, I decided to reuse the preempt variable.
(Vincent, Peter)

- Seperated this patch from the newidle_balance() fixes series since
there are PREEMPT_RT bits that requires deeper review whereas this is
an independent enhancement on its own.

What is the status of the other part of v1 patchset to run idle load
balance instead of newly idle load balance ?

Just posted the series out to LKML. You can find it here
https://lore.kernel.org/lkml/20240904111223.1035-1-kprateek.nayak@xxxxxxx/

Sorry for the delay and thank you again for the review.

--
Thanks and Regards,
Prateek



[..snip..]