Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

From: Swapnil Sapkal
Date: Thu Aug 03 2023 - 01:54:09 EST


Hello Mathieu,

On 7/27/2023 12:26 AM, Mathieu Desnoyers wrote:
On 7/26/23 13:40, Shrikanth Hegde wrote:


On 7/26/23 7:37 PM, Mathieu Desnoyers wrote:
On 7/26/23 04:04, Shrikanth Hegde wrote:


On 7/26/23 1:00 AM, Mathieu Desnoyers wrote:
Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
has exited the idle loop.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 34s. (30% speedup)

My working hypothesis for why this helps is: queuing more than a single
task on the runqueue of a cpu which just exited idle rather than
spreading work over other idle cpus helps power efficiency on systems
with large number of cores.

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench.

Do you have SMT here? What is the system utilization when you are running
this workload?

Yes, SMT is enabled, which brings the number of logical cpus to 384.

CPU utilization (through htop):

* 6.4.4:                                           27500%
* 6.4.4 with the extend-idle+nr_running<=4 patch:  30500%



It turned out that changing this raw spinlock for a loop of 10000x
cpu_relax within do_idle() had similar benefits.

This patch achieve a similar effect without the busy-waiting by
introducing a runqueue state sampling the sched_clock() when exiting
idle, which allows select_task_rq to consider "as idle" a cpu which has
recently exited idle.

This patch should be considered "food for thoughts", and I would be glad
to hear feedback on whether it causes regressions on _other_ workloads,
and whether it helps with the hackbench workload on large Intel system
as well.

Link:
https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
Cc: x86@xxxxxxxxxx
---
   kernel/sched/core.c  | 4 ++++
   kernel/sched/sched.h | 3 +++
   2 files changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..d40e3a0a5ced 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6769,6 +6769,7 @@ void __sched schedule_idle(void)
        * TASK_RUNNING state.
        */
       WARN_ON_ONCE(current->__state);
+    WRITE_ONCE(this_rq()->idle_end_time, sched_clock());
       do {
           __schedule(SM_NONE);
       } while (need_resched());
@@ -7300,6 +7301,9 @@ int idle_cpu(int cpu)
   {
       struct rq *rq = cpu_rq(cpu);
   +    if (sched_clock() < READ_ONCE(rq->idle_end_time) +
IDLE_CPU_DELAY_NS)


Wouldn't this hurt the latency badly? Specially on a loaded system with
a workload that does a lot of wakeup.

Good point !

Can you try your benchmark replacing the if () statement above by:

+       if (sched_clock() < READ_ONCE(rq->idle_end_time) +
IDLE_CPU_DELAY_NS &&
+           READ_ONCE(rq->nr_running) <= 4)
+               return 1;


Tried with this change. I think it does help in reducing latency compared to
earlier specially till 95th percentile.

For the records, I also tried with nr_running <= 2 and still had decent performance
(32s with nr_running <= 2 instead of 30s for nr_running <= 4). It did drop with
nr_running <= 1 (40s). nr_running <= 5 was similar to 4, and performances start
degrading with nr_running <= 8 (31s).

So it might be interesting to measure the latency with nr_running <= 2 as well.
Perhaps nr_running <= 2 would be a good compromise between throughput and tail
latency.

                 6.5-rc3      6.5-rc3+RFC_Patch     6.5-rc3_RFC_Patch
                                                      + nr<4
4 Groups
50.0th:          18.00                18.50           18.50
75.0th:          21.50                26.00           23.50
90.0th:          56.00                940.50          501.00
95.0th:          678.00               1896.00         1392.00
99.0th:          2484.00              3756.00         3708.00
99.5th:          3224.00              4616.00         5088.00
99.9th:          4960.00              6824.00         8068.00
8 Groups
50.0th:          23.50                25.50           23.00
75.0th:          30.50                421.50          30.50
90.0th:          443.50               1722.00         741.00
95.0th:          1410.00              2736.00         1670.00
99.0th:          3942.00              5496.00         4032.00
99.5th:          5232.00              7016.00         5064.00
99.9th:          7996.00              8896.00         8012.00
16 Groups
50.0th:          33.50                41.50           32.50
75.0th:          49.00                752.00          47.00
90.0th:          1067.50              2332.00         994.50
95.0th:          2093.00              3468.00         2117.00
99.0th:          5048.00              6728.00         5568.00
99.5th:          6760.00              7624.00         6960.00
99.9th:          8592.00              9504.00         11104.00
32 Groups
50.0th:          60.00                79.00           53.00
75.0th:          456.50               1712.00         209.50
90.0th:          2788.00              3996.00         2752.00
95.0th:          4544.00              5768.00         5024.00
99.0th:          8444.00              9104.00         10352.00
99.5th:          9168.00              9808.00         12720.00
99.9th:          11984.00             12448.00        17624.00

[...]

   @@ -1010,6 +1012,7 @@ struct rq {
         struct task_struct __rcu    *curr;
       struct task_struct    *idle;
+    u64            idle_end_time;

There is clock_idle already in the rq. Can that be used for the same?

Good point! And I'll change my use of "sched_clock()" in idle_cpu() for a
proper "sched_clock_cpu(cpu_of(rq))", which will work better on systems
without constant tsc.

The updated patch:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..1c7d5bd2968b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7300,6 +7300,10 @@ int idle_cpu(int cpu)
 {
     struct rq *rq = cpu_rq(cpu);

+    if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING &&
+        sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS)
+        return 1;
+
     if (rq->curr != rq->idle)
         return 0;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 81ac605b9cd5..57a49a5524f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -97,6 +97,9 @@
 # define SCHED_WARN_ON(x)      ({ (void)(x), 0; })
 #endif

+#define IDLE_CPU_DELAY_NS        1000000        /* 1ms */
+#define IDLE_CPU_DELAY_MAX_RUNNING    4
+
 struct rq;
 struct cpuidle_state;

I have run some standard micro-benchmarks to test this patch on a 2 Socket Bergamo System
which has 256C/512T (2 X 128C/Socket). The following are the results:

base : 6.5.0-rc4
base + extend-idle : base + Original patch which has change to to extend idle state by 1ms.
base + extend-idle + nr_running : base + updated patch which contains both extend idle and nr_running limit.

========================= hackbench =========================
Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1-groups: 19.95 (0.00 pct) 15.92 (20.20 pct) 15.30 (23.30 pct)
2-groups: 21.33 (0.00 pct) 23.00 (-7.82 pct) 19.70 (7.64 pct)
4-groups: 22.57 (0.00 pct) 30.02 (-33.00 pct) 26.74 (-18.47 pct)
8-groups: 24.68 (0.00 pct) 32.54 (-31.84 pct) 28.27 (-14.54 pct)
16-groups: 31.20 (0.00 pct) 32.47 (-4.07 pct) 27.70 (11.21 pct)

Observation: Hackbench shows improvement with lower and higher number of groups still
it shows dip in performance for middle order.

========================= new_schbench =========================
Metric: wakeup_lat_summary
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 30.00 (0.00 pct) 33.00 (-10.00 pct) 33.00 (-10.00 pct)
2: 26.00 (0.00 pct) 32.00 (-23.07 pct) 33.00 (-26.92 pct)
4: 26.00 (0.00 pct) 32.00 (-23.07 pct) 32.00 (-23.07 pct)
8: 9.00 (0.00 pct) 10.00 (-11.11 pct) 9.00 (0.00 pct)
16: 8.00 (0.00 pct) 10.00 (-25.00 pct) 9.00 (-12.50 pct)
32: 8.00 (0.00 pct) 9.00 (-12.50 pct) 10.00 (-25.00 pct)
64: 8.00 (0.00 pct) 10.00 (-25.00 pct) 10.00 (-25.00 pct)
128: 8.00 (0.00 pct) 11.00 (-37.50 pct) 12.00 (-50.00 pct)
256: 102.00 (0.00 pct) 48.00 (52.94 pct) 50.00 (50.98 pct)
512: 20704.00 (0.00 pct) 23200.00 (-12.05 pct) 23200.00 (-12.05 pct)

Metric: request_lat_summary
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 6712.00 (0.00 pct) 6744.00 (-0.47 pct) 6760.00 (-0.71 pct)
2: 6792.00 (0.00 pct) 6840.00 (-0.70 pct) 6872.00 (-1.17 pct)
4: 6792.00 (0.00 pct) 6840.00 (-0.70 pct) 6856.00 (-0.94 pct)
8: 6776.00 (0.00 pct) 6824.00 (-0.70 pct) 6872.00 (-1.41 pct)
16: 6760.00 (0.00 pct) 6792.00 (-0.47 pct) 6872.00 (-1.65 pct)
32: 6808.00 (0.00 pct) 6776.00 (0.47 pct) 6872.00 (-0.94 pct)
64: 6808.00 (0.00 pct) 6776.00 (0.47 pct) 6872.00 (-0.94 pct)
128: 12208.00 (0.00 pct) 11856.00 (2.88 pct) 12784.00 (-4.71 pct)
256: 13264.00 (0.00 pct) 13296.00 (-0.24 pct) 13680.00 (-3.13 pct)
512: 84096.00 (0.00 pct) 100992.00 (-20.09 pct) 110208.00 (-31.05 pct)

Metric: rps_summary
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 305.00 (0.00 pct) 302.00 (0.98 pct) 296.00 (2.95 pct)
2: 607.00 (0.00 pct) 601.00 (0.98 pct) 593.00 (2.30 pct)
4: 1214.00 (0.00 pct) 1202.00 (0.98 pct) 1190.00 (1.97 pct)
8: 2436.00 (0.00 pct) 2420.00 (0.65 pct) 2372.00 (2.62 pct)
16: 4888.00 (0.00 pct) 4872.00 (0.32 pct) 4808.00 (1.63 pct)
32: 9776.00 (0.00 pct) 9744.00 (0.32 pct) 9680.00 (0.98 pct)
64: 19616.00 (0.00 pct) 19488.00 (0.65 pct) 19360.00 (1.30 pct)
128: 38592.00 (0.00 pct) 38720.00(-0.33 pct) 38080.00 (1.32 pct)
256: 41024.00 (0.00 pct) 40896.00 (0.31 pct) 38976.00 (4.99 pct)
512: 36800.00 (0.00 pct) 35776.00 (2.78 pct) 33728.00 (8.34 pct)

Observation: new_schbench is showing regression in wakeup latencies while request
latencies and rps latencies shows no change except for highly loaded case in request
latency.

========================= schbench =========================
#workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1: 185.00 (0.00 pct) 181.00 (2.16 pct) 181.00 (2.16 pct)
2: 191.00 (0.00 pct) 192.00 (-0.52 pct) 189.00 (1.04 pct)
4: 191.00 (0.00 pct) 192.00 (-0.52 pct) 194.00 (-1.57 pct)
8: 218.00 (0.00 pct) 198.00 (9.17 pct) 200.00 (8.25 pct)
16: 226.00 (0.00 pct) 225.00 (0.44 pct) 224.00 (0.88 pct)
32: 537.00 (0.00 pct) 537.00 (0.00 pct) 539.00 (-0.37 pct)
64: 605.00 (0.00 pct) 621.00 (-2.64 pct) 619.00 (-2.31 pct)
128: 765.00 (0.00 pct) 795.00 (-3.92 pct) 781.00 (-2.09 pct)
256: 1122.00 (0.00 pct) 1150.00 (-2.49 pct) 1150.00 (-2.49 pct)
512: 2276.00 (0.00 pct) 2476.00 (-8.78 pct) 2404.00 (-5.62 pct)

========================= tbench =========================
Clients: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1 386.07 (0.00 pct) 390.13 (1.05 pct) 463.49 (20.05 pct)
2 718.40 (0.00 pct) 856.37 (19.20 pct) 799.31 (11.26 pct)
4 1399.34 (0.00 pct) 1514.03 (8.19 pct) 1482.01 (5.90 pct)
8 2716.56 (0.00 pct) 3000.02 (10.43 pct) 2823.23 (3.92 pct)
16 5275.97 (0.00 pct) 5468.17 (3.64 pct) 5430.77 (2.93 pct)
32 10534.37 (0.00 pct) 10442.13 (-0.87 pct) 10386.92 (-1.39 pct)
64 22079.03 (0.00 pct) 19161.30 (-13.21 pct) 16773.73 (-24.02 pct)
128 41051.13 (0.00 pct) 29923.57 (-27.10 pct) 22510.13 (-45.16 pct)
256 55603.43 (0.00 pct) 43203.67 (-22.30 pct) 40343.17 (-27.44 pct)
512 130673.33 (0.00 pct) 76581.40 (-41.39 pct) 69152.47 (-47.07 pct)
1024 133323.67 (0.00 pct) 114910.67 (-13.81 pct) 107728.67 (-19.19 pct)
2048 143674.33 (0.00 pct) 123842.67 (-13.80 pct) 107202.00 (-25.38 pct)

Observation: tbench is showing dip in throughput for 64 clients and onwards.

====================== stream 10 RUNS ======================
Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
Copy: 354190.51 (0.00 pct) 356650.82 (0.69 pct) 353287.34 (-0.25 pct)
Scale: 355427.44 (0.00 pct) 356686.34 (0.35 pct) 354406.79 (-0.28 pct)
Add: 373800.46 (0.00 pct) 376610.56 (0.75 pct) 374609.00 (0.21 pct)
Triad: 374697.25 (0.00 pct) 377635.98 (0.78 pct) 375343.44 (0.17 pct)

====================== stream 100 RUNS ======================
Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
Copy: 357922.89 (0.00 pct) 356560.50 (-0.38 pct) 356507.22 (-0.39 pct)
Scale: 358118.38 (0.00 pct) 357435.86 (-0.19 pct) 358033.29 (-0.02 pct)
Add: 375307.34 (0.00 pct) 376046.70 (0.19 pct) 375586.33 (0.07 pct)
Triad: 375656.40 (0.00 pct) 376674.43 (0.27 pct) 376581.95 (0.24 pct)

========================== netperf ==========================
Clients: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4
1-clients: 114299.07 (0.00 pct) 110695.40 (-3.15 pct) 111533.30 (-2.41 pct)
2-clients: 114130.01 (0.00 pct) 110192.51 (-3.45 pct) 111682.01 (-2.14 pct)
4-clients: 109126.45 (0.00 pct) 107275.60 (-1.69 pct) 109574.77 (0.41 pct)
8-clients: 111209.21 (0.00 pct) 104360.40 (-6.15 pct) 106518.41 (-4.21 pct)
16-clients: 102955.20 (0.00 pct) 100968.38 (-1.92 pct) 101897.25 (-1.02 pct)
32-clients: 98537.18 (0.00 pct) 103018.09 ( 4.54 pct) 103917.70 (5.46 pct)
64-clients: 103619.68 (0.00 pct) 100376.37 (-3.13 pct) 102651.24 (-0.93 pct)
128-clients: 98536.55 (0.00 pct) 77845.69 (-20.99 pct) 98566.87 (0.03 pct)
256-clients: 51934.45 (0.00 pct) 52844.10 (1.75 pct) 53562.99 (3.13 pct)

I have also tried the same experiment on one socket Genoa system with 96C/192T. On that
system also I am seeing similar behavior.

Can you share your build config just in case I am missing something.


And using it now brings the hackbench wall time at 28s :)

Thanks,

Mathieu


       struct task_struct    *stop;
       unsigned long        next_balance;
       struct mm_struct    *prev_mm;


--
Thanks and regards,
Swapnil