Re: [RFC PATCH v4 0/2] sched/fair: Choose the CPU where short task is running during wake up

From: K Prateek Nayak
Date: Thu Dec 29 2022 - 02:17:38 EST


Hello Chenyu,

Including the detailed results from testing below.

tl;dr

o There seems to be 3 noticeable regressions:
- tbench for lower number of clients. The schedstat data shows
an increase in wait time.
- SpecJBB MultiJVM performance drops as the workload prefers
an idle CPU over a busy one.
- Unixbench-pipe benchmark performance drops.

o Most benchmark numbers remain same.

o Small gains seen for ycsb-mongodb and unixbench-syscall.

On 12/16/2022 11:38 AM, Chen Yu wrote:
> The main purpose of this change is to avoid too many cross CPU
> wake up when it is unnecessary. The frequent cross CPU wake up
> brings significant damage to some workloads, especially on high
> core count systems.
>
> This patch set inhibits the cross CPU wake-up by placing the wakee
> on waking CPU or previous CPU, if both the waker and wakee are
> short-duration tasks.
>
> The first patch is to introduce the definition of a short-duration
> task. The second patch leverages the first patch to choose a local
> or previous CPU for wakee.
>
> Changes since v3:
> 1. Honglei and Josh have concern that the threshold of short
> task duration could be too long. Decreased the threshold from
> sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> and the '8' comes from get_update_sysctl_factor().
> 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> 3. Move the calculation of average duration from put_prev_task_fair()
> to dequeue_task_fair(). Because there is an issue in v3 that,
> put_prev_task_fair() will not be invoked by pick_next_task_fair()
> in fast path, thus the dur_avg could not be updated timely.
> 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> 5. Move the scan for CPU with short duration task from select_idle_cpu()
> to select_idle_siblings(), because there is no CPU scan involved, per> Yicong.

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip: 6.1.0-rc2 tip sched/core
- sis_short: 6.1.0-rc2 tip sched/core + this series

When the testing started, the tip was at:
commit d6962c4fe8f9 "sched: Clear ttwu_pending after enqueue_task()"

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test: tip sis_short
1-groups: 4.25 (0.00 pct) 4.26 (-0.23 pct)
2-groups: 4.95 (0.00 pct) 4.87 (1.61 pct)
4-groups: 5.19 (0.00 pct) 5.09 (1.92 pct)
8-groups: 5.45 (0.00 pct) 5.37 (1.46 pct)
16-groups: 7.33 (0.00 pct) 7.65 (-4.36 pct)

NPS2

Test: tip sis_short
1-groups: 4.09 (0.00 pct) 4.20 (-2.68 pct)
2-groups: 4.68 (0.00 pct) 4.75 (-1.49 pct)
4-groups: 5.05 (0.00 pct) 4.94 (2.17 pct)
8-groups: 5.37 (0.00 pct) 5.29 (1.48 pct)
16-groups: 6.69 (0.00 pct) 6.78 (-1.34 pct)

NPS4

Test: tip sis_short
1-groups: 4.28 (0.00 pct) 4.37 (-2.10 pct)
2-groups: 4.78 (0.00 pct) 4.82 (-0.83 pct)
4-groups: 5.11 (0.00 pct) 5.06 (0.97 pct)
8-groups: 5.48 (0.00 pct) 5.38 (1.82 pct)
16-groups: 7.07 (0.00 pct) 6.93 (1.98 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers: tip sis_short
1: 31.00 (0.00 pct) 32.00 (-3.22 pct)
2: 33.00 (0.00 pct) 32.00 (3.03 pct)
4: 39.00 (0.00 pct) 34.00 (12.82 pct)
8: 45.00 (0.00 pct) 43.00 (4.44 pct)
16: 61.00 (0.00 pct) 66.00 (-8.19 pct)
32: 108.00 (0.00 pct) 107.00 (0.92 pct)
64: 212.00 (0.00 pct) 223.00 (-5.18 pct)
128: 475.00 (0.00 pct) 519.00 (-9.26 pct) *
128: 434.00 (0.00 pct) 438.00 (-0.92 pct) [Verification Run]
256: 44736.00 (0.00 pct) 42048.00 (6.00 pct)
512: 77184.00 (0.00 pct) 76672.00 (0.66 pct)

NPS2

#workers: tip sis_short
1: 28.00 (0.00 pct) 32.00 (-14.28 pct)
2: 34.00 (0.00 pct) 37.00 (-8.82 pct)
4: 36.00 (0.00 pct) 37.00 (-2.77 pct)
8: 51.00 (0.00 pct) 47.00 (7.84 pct)
16: 68.00 (0.00 pct) 67.00 (1.47 pct)
32: 113.00 (0.00 pct) 117.00 (-3.53 pct)
64: 221.00 (0.00 pct) 221.00 (0.00 pct)
128: 553.00 (0.00 pct) 567.00 (-2.53 pct)
256: 43840.00 (0.00 pct) 47040.00 (-7.29 pct) *
256: 46016.00 (0.00 pct) 46272.00 (-0.55 pct) [Verification Run]
512: 76672.00 (0.00 pct) 78976.00 (-3.00 pct)

NPS4

#workers: tip sis_short
1: 33.00 (0.00 pct) 32.00 (3.03 pct)
2: 29.00 (0.00 pct) 42.00 (-44.82 pct)
4: 39.00 (0.00 pct) 43.00 (-10.25 pct)
8: 58.00 (0.00 pct) 56.00 (3.44 pct)
16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
32: 112.00 (0.00 pct) 113.00 (-0.89 pct)
64: 215.00 (0.00 pct) 214.00 (0.46 pct)
128: 689.00 (0.00 pct) 823.00 (-19.44 pct) *
128: 424.00 (0.00 pct) 449.00 (-5.89 pct) [Verification Run]
256: 45120.00 (0.00 pct) 44608.00 (1.13 pct)
512: 77440.00 (0.00 pct) 79488.00 (-2.64 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients: tip sis_short
1 581.75 (0.00 pct) 545.15 (-6.29 pct) *
1 588.80 (0.00 pct) 547.71 (-6.97 pct) [Verification Run]
2 1145.75 (0.00 pct) 1074.91 (-6.18 pct) *
2 1153.55 (0.00 pct) 1068.06 (-7.41 pct) [Verification Run]
4 2127.94 (0.00 pct) 1958.71 (-7.95 pct) *
4 2133.95 (0.00 pct) 1965.03 (-7.91 pct) [Verification Run]
8 3838.27 (0.00 pct) 3701.75 (-3.55 pct)
16 6272.71 (0.00 pct) 6045.56 (-3.62 pct)
32 11400.12 (0.00 pct) 11688.67 (2.53 pct)
64 21605.96 (0.00 pct) 22070.92 (2.15 pct)
128 30715.43 (0.00 pct) 30772.11 (0.18 pct)
256 55580.78 (0.00 pct) 54998.38 (-1.04 pct)
512 56528.79 (0.00 pct) 55239.00 (-2.28 pct)
1024 56520.40 (0.00 pct) 55153.71 (-2.41 pct)

NPS2

Clients: tip sis_short
1 584.13 (0.00 pct) 555.55 (-4.89 pct)
2 1153.63 (0.00 pct) 1094.62 (-5.11 pct) *
2 1147.26 (0.00 pct) 1063.35 (-7.31 pct) [Verification Run]
4 2212.89 (0.00 pct) 2049.24 (-7.39 pct) *
4 2133.57 (0.00 pct) 2004.66 (-6.04 pct) [Verification Run]
8 3871.35 (0.00 pct) 3793.66 (-2.00 pct)
16 6216.72 (0.00 pct) 5895.87 (-5.16 pct) *
16 6342.15 (0.00 pct) 5962.58 (-5.98 pct) [Verification Run]
32 11766.98 (0.00 pct) 11013.97 (-6.39 pct) *
32 11009.53 (0.00 pct) 10215.06 (-7.21 pct) [Verification Run]
64 22000.93 (0.00 pct) 20563.47 (-6.53 pct) *
64 20315.34 (0.00 pct) 20097.13 (-1.07 pct) [Verification Run]
128 31520.53 (0.00 pct) 27489.63 (-12.78 pct) *
128 29224.27 (0.00 pct) 27872.81 (-4.62 pct) [Verification Run]
256 51420.11 (0.00 pct) 51968.82 (1.06 pct)
512 53935.90 (0.00 pct) 54053.59 (0.21 pct)
1024 55239.73 (0.00 pct) 54505.89 (-1.32 pct)

NPS4

Clients: tip sis_short
1 585.83 (0.00 pct) 547.36 (-6.56 pct) *
1 590.36 (0.00 pct) 546.57 (-7.41 pct) [Verification Run]
2 1141.59 (0.00 pct) 1061.99 (-6.97 pct) *
2 1167.56 (0.00 pct) 1059.16 (-9.28 pct) [Verification Run]
4 2174.79 (0.00 pct) 2002.42 (-7.92 pct) *
4 2169.72 (0.00 pct) 1973.22 (-9.05 pct) [Verification Run]
8 3887.56 (0.00 pct) 3543.62 (-8.84 pct) *
8 3747.36 (0.00 pct) 3398.57 (-9.30 pct) [Verification Run]
16 6441.59 (0.00 pct) 5896.92 (-8.45 pct) *
16 6165.54 (0.00 pct) 6019.96 (-2.36 pct) [Verification Run]
32 12133.60 (0.00 pct) 11074.16 (-8.73 pct) *
32 11182.63 (0.00 pct) 11241.82 (0.52 pct) [Verification Run]
64 21769.15 (0.00 pct) 19485.94 (-10.48 pct) *
64 20168.52 (0.00 pct) 20333.98 (0.82 pct) [Verification Run]
128 31396.31 (0.00 pct) 29029.29 (-7.53 pct) *
128 29437.98 (0.00 pct) 27438.60 (-6.79 pct) [Verification Run]
256 52792.39 (0.00 pct) 51560.92 (-2.33 pct)
512 55315.44 (0.00 pct) 52914.21 (-4.34 pct)
1024 52150.27 (0.00 pct) 53286.08 (2.17 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

10 Runs:

Test: tip sis_short
Copy: 307827.79 (0.00 pct) 320671.93 (4.17 pct)
Scale: 208872.28 (0.00 pct) 213725.16 (2.32 pct)
Add: 239404.64 (0.00 pct) 240872.59 (0.61 pct)
Triad: 247258.30 (0.00 pct) 249764.53 (1.01 pct)

100 Runs:

Test: tip sis_short
Copy: 317217.55 (0.00 pct) 318031.46 (0.25 pct)
Scale: 208740.82 (0.00 pct) 213773.01 (2.41 pct)
Add: 240550.63 (0.00 pct) 239618.79 (-0.38 pct)
Triad: 249594.21 (0.00 pct) 245722.98 (-1.55 pct)

NPS2

10 Runs:

Test: tip sis_short
Copy: 340877.18 (0.00 pct) 336204.91 (-1.37 pct)
Scale: 217318.16 (0.00 pct) 218200.84 (0.40 pct)
Add: 259078.93 (0.00 pct) 258847.83 (-0.08 pct)
Triad: 274500.78 (0.00 pct) 267234.01 (-2.64 pct)

100 Runs:

Test: tip sis_short
Copy: 341860.73 (0.00 pct) 335116.25 (-1.97 pct)
Scale: 218043.00 (0.00 pct) 219265.81 (0.56 pct)
Add: 253698.22 (0.00 pct) 260792.08 (2.79 pct)
Triad: 265011.84 (0.00 pct) 269655.66 (1.75 pct)

NPS4

10 Runs:

Test: tip sis_short
Copy: 323775.81 (0.00 pct) 363923.22 (12.39 pct)
Scale: 237719.83 (0.00 pct) 238074.93 (0.14 pct)
Add: 251896.35 (0.00 pct) 269129.04 (6.84 pct)
Triad: 264124.72 (0.00 pct) 284456.89 (7.69 pct)

100 Runs:

Test: tip sis_short
Copy: 360777.62 (0.00 pct) 375325.94 (4.03 pct)
Scale: 239333.27 (0.00 pct) 238341.16 (-0.41 pct)
Add: 271142.94 (0.00 pct) 269924.55 (-0.44 pct)
Triad: 285408.97 (0.00 pct) 290345.65 (1.72 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip: 131696.33 (var: 2.03%)
sis_short: 129031.33 (var: 1.71%) (-2.02%)

o NPS2:

tip: 129895.33 (var: 2.34%)
sis_short: 133652.33 (var: 1.11%) (+2.89%)

o NPS4:

tip: 131165.00 (var: 1.06%)
sis_short: 135975.00 (var: 0.34%) (+3.66%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb MultiJVM - NPS1 ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Kernel tip sis_short
Max-jOPS 100% 96%
Critical-jOPS 100% 97%

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48929419.48 ( 0.00%) 48992339.28 ( 0.13%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6266355505.80 ( 0.00%) 6251441423.60 ( -0.24%)
unixbench-syscall Amean unixbench-syscall-1 2994319.73 ( 0.00%) 2665595.13 * 10.98%*
unixbench-syscall Amean unixbench-syscall-512 7349715.87 ( 0.00%) 7645690.70 * -4.03%*
unixbench-pipe Hmean unixbench-pipe-1 2830206.03 ( 0.00%) 2508957.89 * -11.35%* *
unixbench-pipe Hmean unixbench-pipe-512 326207828.01 ( 0.00%) 306588592.66 * -6.01%* *
unixbench-spawn Hmean unixbench-spawn-1 6394.21 ( 0.00%) 6153.47 ( -3.76%)
unixbench-spawn Hmean unixbench-spawn-512 72700.64 ( 0.00%) 87873.75 * 20.87%*
unixbench-execl Hmean unixbench-execl-1 4723.61 ( 0.00%) 4678.46 ( -0.96%)
unixbench-execl Hmean unixbench-execl-512 11212.05 ( 0.00%) 11067.41 * -1.29%*

o NPS2

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49271512.85 ( 0.00%) 48842191.93 ( -0.87%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6267992483.03 ( 0.00%) 6261579339.17 ( -0.10%)
unixbench-syscall Amean unixbench-syscall-1 2995885.93 ( 0.00%) 2668794.40 * 10.92%*
unixbench-syscall Amean unixbench-syscall-512 7388865.77 ( 0.00%) 7229531.57 * 2.16%*
unixbench-pipe Hmean unixbench-pipe-1 2828971.95 ( 0.00%) 2509560.35 * -11.29%* *
unixbench-pipe Hmean unixbench-pipe-512 326225385.37 ( 0.00%) 306941838.69 * -5.91%* *
unixbench-spawn Hmean unixbench-spawn-1 6958.71 ( 0.00%) 6613.24 ( -4.96%)
unixbench-spawn Hmean unixbench-spawn-512 85443.56 ( 0.00%) 89130.15 * 4.31%*
unixbench-execl Hmean unixbench-execl-1 4767.99 ( 0.00%) 4684.44 * -1.75%*
unixbench-execl Hmean unixbench-execl-512 11250.72 ( 0.00%) 11197.80 ( -0.47%)

o NPS4

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49041932.68 ( 0.00%) 48942640.63 ( -0.20%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6277864986.70 ( 0.00%) 6263890106.10 ( -0.22%)
unixbench-syscall Amean unixbench-syscall-1 2992405.60 ( 0.00%) 2663959.00 * 10.98%*
unixbench-syscall Amean unixbench-syscall-512 7971789.70 ( 0.00%) 7834566.33 * 1.72%*
unixbench-pipe Hmean unixbench-pipe-1 2822892.54 ( 0.00%) 2505995.43 * -11.23%* *
unixbench-pipe Hmean unixbench-pipe-512 326408309.83 ( 0.00%) 306887263.78 * -5.98%* *
unixbench-spawn Hmean unixbench-spawn-1 7685.31 ( 0.00%) 7675.29 ( -0.13%)
unixbench-spawn Hmean unixbench-spawn-512 72245.56 ( 0.00%) 74190.34 * 2.69%*
unixbench-execl Hmean unixbench-execl-1 4761.42 ( 0.00%) 4683.56 * -1.64%*
unixbench-execl Hmean unixbench-execl-512 11533.53 ( 0.00%) 11298.59 ( -2.04%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Analyzing tbench (2 client) for NPS1 configuration ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tip: 1126.33 MB/sec
sis_short: 1031.26 MB/sec (-8.44%)

-> Legend for per CPU stats:

rq->yld_count: sched_yield count
rq->sched_count: schedule called
rq->sched_goidle: schedule left the processor idle
rq->ttwu_count: try_to_wake_up was called
rq->ttwu_local: try_to_wake_up was called to wake up the local cpu
rq->rq_cpu_time: total runtime by tasks on this processor (in jiffies)
rq->rq_sched_info.run_delay: total waittime by tasks on this processor (in jiffies)
rq->rq_sched_info.pcount: total timeslices run on this cpu

-> System wide stats

-------------------------------------------------------------------------------------------------------------------------
cpu: all_cpus (avg) vs cpu: all_cpus (avg)
-------------------------------------------------------------------------------------------------------------------------
sched_yield count : 0, 0
Legacy counter can be ignored : 0, 0
schedule called : 233543, 213786 | -8.46|
schedule left the processor idle : 116745, 106863 | -8.46|
try_to_wake_up was called : 116772, 106893 | -8.46|
try_to_wake_up was called to wake up the local cpu : 74, 84 | 13.51|
total runtime by tasks on this processor (in jiffies) : 579771198, 585440128
total waittime by tasks on this processor (in jiffies) : 176365, 258702 | 46.69| * Wait time is much longer
total timeslices run on this cpu : 116797, 106922 | -8.45|
-------------------------------------------------------------------------------------------------------------------------
< ----------------------------------------------------------------- Wakeup info: ------------------------------------ >
Wakeups on same SMT cpus = all_cpus (avg) : 0, 0
Wakeups on same MC cpus = all_cpus (avg) : 116689, 106797 | -8.48|
Wakeups on same DIE cpus = all_cpus (avg) : 2, 4
Wakeups on same NUMA cpus = all_cpus (avg) : 5, 7
Affine wakeups on same SMT cpus = all_cpus (avg) : 0, 0
Affine wakeups on same MC cpus = all_cpus (avg) : 116667, 106781 | -8.47|
Affine wakeups on same DIE cpus = all_cpus (avg) : 2, 4
Affine wakeups on same NUMA cpus = all_cpus (avg) : 5, 6
--------------------------------------------------------------------------------------------------------------------------

The rq->rq_sched_info.pcount and rq->sched_count seems to
have reduced proportionally.

>
> Changes since v2:
>
> 1. Peter suggested comparing the duration of waker and the cost to
> scan for an idle CPU: If the cost is higher than the task duration,
> do not waste time finding an idle CPU, choose the local or previous
> CPU directly. A prototype was created based on this suggestion.
> However, according to the test result, this prototype does not inhibit
> the cross CPU wakeup and did not bring improvement. Because the cost
> to find an idle CPU is small in the problematic scenario. The root
> cause of the problem is a race condition between scanning for an idle
> CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> So v3 does not change the core logic of v2, with some refinement based
> on Peter's suggestion.
>
> 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
>
> This change brings overall improvement on some microbenchmarks, both on
> Intel and AMD platforms.
>
> v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@xxxxxxxxx/
> v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@xxxxxxxxx/
> v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@xxxxxxxxx/
>
> Chen Yu (2):
> sched/fair: Introduce short duration task check
> sched/fair: Choose the CPU where short task is running during wake up
>
> include/linux/sched.h | 3 +++
> kernel/sched/core.c | 2 ++
> kernel/sched/debug.c | 1 +
> kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 5 files changed, 39 insertions(+)
>

All numbers are with turbo and C2 enabled. I wonder if the
check "(5 * nr < 3 * sd->span_weight)" in v2 helped workloads
like tbench and SpecJBB. I'll queue some runs with the condition
added back and separate run with turbo and C2 disabled to see
if they helps. I'll update the thread once the results are in.

--
Thanks and Regards,
Prateek