Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()

From: K Prateek Nayak
Date: Mon Aug 05 2024 - 00:04:13 EST

Next message: Saurabh Singh Sengar: "Re: [PATCH v3 2/7] Drivers: hv: Enable VTL mode for arm64"
Previous message: Krishna Chaitanya Chundru: "Re: [PATCH v2 1/8] dt-bindings: PCI: Add binding for qps615"
In reply to: Chen Yu: "Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Chenyu,

Thank you for testing the series. I'll have a second version out soon.

On 8/4/2024 9:35 AM, Chen Yu wrote:

On 2024-07-31 at 00:13:40 +0800, Chen Yu wrote:

On 2024-07-10 at 09:02:09 +0000, K Prateek Nayak wrote:

From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.

Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.

With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves significantly. Following are the numbers
from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
C2 disabled) running ipistorm between CPU8 and CPU16:

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
==================================================================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.25 [75.11%]

[ kprateek: Commit log and testing ]

Link: https://lore.kernel.org/lkml/20240615012814.GP8774@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
Not-yet-signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>

Only with current patch applied on top of sched/core commit c793a62823d1,
a significant throughput/run-to-run variance improvement is observed
on an Intel 240 CPUs/ 2 Nodes server. C-states >= C1E are disabled,
CPU frequency governor is set to performance and turbo-boost disabled.

Without the patch(lower the better):

158490995
113086433
737869191
302454894
731262790
677283357
729767478
830949261
399824606
743681976

(Amean): 542467098
(Std): 257011706

With the patch(lower the better):
128060992
115646768
132734621
150330954
113143538
169875051
145010400
151589193
162165800
159963320

(Amean): 142852063
(Std): 18646313

I've launched full tests for schbench/hackbench/netperf/tbench
to see if there is any difference.

Tested without CONFIG_PREEMPT_RT, so issue for SM_RTLOCK_WAIT as mentioned
by Vincent might not bring any impact. There is no obvious difference
(regression) detected according to the test in the 0day environment. Overall
this patch looks good to me. Once you send a refresh version out I'll re-launch
the test.

Since SM_RTLOCK_WAIT is only used by schedule_rtlock(), which is only
defined for PREEMPT_RT kernels, non RT build should have no issue. I
could spot at least one case in rtlock_slowlock_locked() where the
pre->__state is set to "TASK_RTLOCK_WAIT" and schedule_rtlock() is
called. With this patch, it would pass the "sched_mode > SM_NONE" check
and call it an involuntary context-switch but on tip,
(preempt & SM_MASK_PREEMPT) would return false and eventually it'll
call deactivate_task() to dequeue the waiting task so this does need
fixing.

From a brief look, all calls to schedule with "SM_RTLOCK_WAIT" already
set the task->__state to a non-zero value. I'll look into this further
after the respin and see if there is some optimization possible there
but for the time being, I'll respin this with the condition changed
to:

...
} else if (preempt != SM_PREEMPT && prev_state) {
...

just to keep it explicit.

Thank you again for testing this version.
--
Thanks and Regards,
Prateek

Tested on Xeon server with 128 CPUs, 4 Numa nodes, under different

baseline with-SM_IDLE

hackbench
load level (25% ~ 100%)

hackbench-pipe-process.throughput
%25:
846099 -0.3% 843217
%50:
972015 +0.0% 972185
%100:
1395650 -1.3% 1376963

hackbench-pipe-threads.throughput
%25:
746629 -0.0% 746345
%50:
885165 -0.4% 881602
%100:
1227790 +1.3% 1243757

hackbench-socket-process.throughput
%25:
395784 +1.2% 400717
%50:
441312 +0.3% 442783
%100:
324283 ± 2% +6.0% 343826

hackbench-socket-threads.throughput
%25:
379700 -0.8% 376642
%50:
425315 -0.4% 423749
%100:
311937 ± 2% +0.9% 314892

baseline with-SM_IDLE

schbench.request_latency_90%_us

1-mthread-1-worker:
4562 -0.0% 4560
1-mthread-16-workers:
4564 -0.0% 4563
12.5%-mthread-1:
4565 +0.0% 4567
12.5%-mthread-16-workers:
39204 +0.1% 39248
25%-mthread-1-worker:
4574 +0.0% 4574
25%-mthread-16-workers:
161944 +0.1% 162053
50%-mthread-1-workers:
4784 ± 5% +0.1% 4789 ± 5%
50%-mthread-16-workers:
659156 +0.4% 661679
100%-mthread-1-workers:
9328 +0.0% 9329
100%-mthread-16-workers:
2489753 -0.7% 2472140

baseline with-SM_IDLE

netperf.Throughput:

25%-TCP_RR:
2449875 +0.0% 2450622 netperf.Throughput_total_tps
25%-UDP_RR:
2746806 +0.1% 2748935 netperf.Throughput_total_tps
25%-TCP_STREAM:
1352061 +0.7% 1361497 netperf.Throughput_total_Mbps
25%-UDP_STREAM:
1815205 +0.1% 1816202 netperf.Throughput_total_Mbps
50%-TCP_RR:
3981514 -0.3% 3970327 netperf.Throughput_total_tps
50%-UDP_RR:
4496584 -1.3% 4438363 netperf.Throughput_total_tps
50%-TCP_STREAM:
1478872 +0.4% 1484196 netperf.Throughput_total_Mbps
50%-UDP_STREAM:
1739540 +0.3% 1744074 netperf.Throughput_total_Mbps
75%-TCP_RR:
3696607 -0.5% 3677044 netperf.Throughput_total_tps
75%-UDP_RR:
4161206 +1.3% 4217274 ± 2% netperf.Throughput_total_tps
75%-TCP_STREAM:
895874 +5.7% 946546 ± 5% netperf.Throughput_total_Mbps
75%-UDP_STREAM:
4100019 -0.3% 4088367 netperf.Throughput_total_Mbps
100%-TCP_RR:
6724456 -1.7% 6610976 netperf.Throughput_total_tps
100%-UDP_RR:
7329959 -0.5% 7294653 netperf.Throughput_total_tps
100%-TCP_STREAM:
808165 +0.3% 810360 netperf.Throughput_total_Mbps
100%-UDP_STREAM:
5562651 +0.0% 5564106 netperf.Throughput_total_Mbps

thanks,
Chenyu

Next message: Saurabh Singh Sengar: "Re: [PATCH v3 2/7] Drivers: hv: Enable VTL mode for arm64"
Previous message: Krishna Chaitanya Chundru: "Re: [PATCH v2 1/8] dt-bindings: PCI: Add binding for qps615"
In reply to: Chen Yu: "Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]