Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

From: K Prateek Nayak
Date: Fri Aug 18 2023 - 00:11:23 EST

Next message: suijingfeng: "Re: [PATCH v4] PCI/VGA: Make the vga_is_firmware_default() less arch-dependent"
Previous message: Tian, Kevin: "RE: [PATCH v2 1/3] iommu: Make single-device group for PASID explicit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Tejun,

On 8/8/2023 8:28 AM, K Prateek Nayak wrote:
> Hello Tejun,
>
> On 8/8/2023 6:52 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>>> Unbound workqueues used to spray work items inside each NUMA node, which
>>> isn't great on CPUs w/ multiple L3 caches. This patchset implements
>>> mechanisms to improve and configure execution locality.
>>
>> The patchset shows minor perf improvements for some but more importantly
>> gives users more control over worker placement which helps working around
>> some of the recently reported performance regressions. Prateek reported
>> concerning regressions with tbench but I couldn't reproduce it and can't see
>> how tbench would be affected at all given the benchmark doesn't involve
>> workqueue operations in any noticeable way.
>>
>> Assuming that the tbench difference was a testing artifact, I'm applying the
>> patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
>> really appreciate if you could repeat the test and see whether the
>> difference persists.
>
> Sure. I'll retest with for-6.6 branch. Will post the results here once the
> tests are done. I'll repeat the same - test with the defaults and the ones
> that show any difference in results, I'll rerun them with various affinity
> scopes.

Sorry I'm lagging on the test queue but following are the results of the
standard benchmarks running on a dual socket 3rd Generation EPYC system
(2 x 64C/128T)

tl;dr

- No noticeable difference in performance.
- The netperf and tbench regression are gone now and the base numbers too
are much higher than before (sorry for the false alarm!)

Following are the results:

base: affinity-scopes-v2 branch at commit 18c8ae813156 ("workqueue:
Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us
is 0")

affinity-scope: affinity-scopes-v2 branch at commit a4da9f618d3e
("workqueue: Add "Affinity Scopes and Performance" section to]
documentation")

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: base[pct imp](CV) affinity-scope[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.76) 0.99 [ 0.56]( 3.02)
2-groups 1.00 [ -0.00]( 1.52) 1.01 [ -0.94]( 2.36)
4-groups 1.00 [ -0.00]( 1.49) 1.02 [ -2.20]( 1.91)
8-groups 1.00 [ -0.00]( 1.12) 1.00 [ -0.00]( 0.93)
16-groups 1.00 [ -0.00]( 3.64) 1.01 [ -0.87]( 2.66)

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) affinity-scope[pct imp](CV)
1 1.00 [ 0.00]( 0.47) 1.00 [ -0.21]( 1.03)
2 1.00 [ 0.00]( 0.10) 1.00 [ 0.00]( 0.45)
4 1.00 [ 0.00]( 1.60) 1.00 [ -0.18]( 0.83)
8 1.00 [ 0.00]( 0.13) 1.00 [ -0.26]( 0.59)
16 1.00 [ 0.00]( 1.69) 1.02 [ 2.05]( 1.08)
32 1.00 [ 0.00]( 0.35) 1.00 [ -0.36]( 2.47)
64 1.00 [ 0.00]( 0.43) 1.00 [ 0.45]( 2.54)
128 1.00 [ 0.00]( 0.31) 0.99 [ -0.82]( 0.58)
256 1.00 [ 0.00]( 1.81) 0.98 [ -1.84]( 1.80)
512 1.00 [ 0.00]( 0.54) 1.00 [ 0.04]( 0.06)
1024 1.00 [ 0.00]( 0.13) 1.01 [ 1.01]( 0.42)

==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) affinity-scope[pct imp](CV)
Copy 1.00 [ 0.00]( 6.45) 1.03 [ 2.50]( 5.75)
Scale 1.00 [ 0.00]( 6.21) 1.03 [ 3.36]( 0.75)
Add 1.00 [ 0.00]( 6.10) 1.04 [ 4.23]( 1.81)
Triad 1.00 [ 0.00]( 7.24) 1.03 [ 3.49]( 3.41)

==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) affinity-scope[pct imp](CV)
Copy 1.00 [ 0.00]( 1.98) 1.00 [ 0.40]( 2.57)
Scale 1.00 [ 0.00]( 4.88) 1.00 [ -0.07]( 5.11)
Add 1.00 [ 0.00]( 4.60) 1.00 [ 0.23]( 5.21)
Triad 1.00 [ 0.00]( 6.21) 1.03 [ 2.85]( 2.55)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) affinity-scope[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.84) 1.01 [ 0.99]( 0.72)
2-clients 1.00 [ 0.00]( 0.64) 1.01 [ 0.53]( 0.77)
4-clients 1.00 [ 0.00]( 0.75) 1.01 [ 0.54]( 0.96)
8-clients 1.00 [ 0.00]( 0.83) 1.00 [ -0.21]( 1.03)
16-clients 1.00 [ 0.00]( 0.75) 1.00 [ 0.31]( 0.81)
32-clients 1.00 [ 0.00]( 0.82) 1.00 [ 0.12]( 1.57)
64-clients 1.00 [ 0.00]( 2.30) 1.00 [ -0.28]( 2.39)
128-clients 1.00 [ 0.00]( 2.54) 0.99 [ -1.01]( 2.61)
256-clients 1.00 [ 0.00]( 4.37) 1.01 [ 1.23]( 2.69)
512-clients 1.00 [ 0.00](48.73) 1.01 [ 0.99](46.07)

==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: base[pct imp](CV) affinity-scope[pct imp](CV)
1 1.00 [ -0.00]( 2.28) 1.00 [ -0.00]( 2.28)
2 1.00 [ -0.00]( 8.55) 0.96 [ 4.00]( 4.17)
4 1.00 [ -0.00]( 3.81) 0.94 [ 6.45]( 8.78)
8 1.00 [ -0.00]( 2.78) 0.97 [ 2.78]( 4.81)
16 1.00 [ -0.00]( 1.22) 0.96 [ 4.26]( 1.27)
32 1.00 [ -0.00]( 2.02) 0.97 [ 2.63]( 3.99)
64 1.00 [ -0.00]( 5.65) 0.99 [ 0.62]( 1.65)
128 1.00 [ -0.00]( 5.17) 0.98 [ 1.91]( 8.12)
256 1.00 [ -0.00](10.79) 1.07 [ -6.82]( 7.18)
512 1.00 [ -0.00]( 1.24) 0.99 [ 0.54]( 1.37)

==================================================================
Test : Unixbench
Units : Various, Througput
Interpretation: Higher is better
Statistic : AMean, Hmean (Specified)
==================================================================
base affinity-scope
Hmean unixbench-dhry2reg-1 40947261.77 ( 0.00%) 41078213.81 ( 0.32%)
Hmean unixbench-dhry2reg-512 6243140251.68 ( 0.00%) 6240938691.75 ( -0.04%)
Amean unixbench-syscall-1 2932806.37 ( 0.00%) 2871035.50 * 2.11%*
Amean unixbench-syscall-512 7689448.00 ( 0.00%) 8406697.27 * 9.33%*
Hmean unixbench-pipe-1 2577667.42 ( 0.00%) 2497979.59 * -3.09%*
Hmean unixbench-pipe-512 363366036.45 ( 0.00%) 356991588.20 * -1.75%*
Hmean unixbench-spawn-1 4446.97 ( 0.00%) 4760.91 * 7.06%*
Hmean unixbench-spawn-512 68983.49 ( 0.00%) 68464.78 * -0.75%*
Hmean unixbench-execl-1 3894.20 ( 0.00%) 3857.78 ( -0.94%)
Hmean unixbench-execl-512 12716.76 ( 0.00%) 13067.63 ( 2.76%)

==================================================================
Test : tbench (Various Affinity Scopes)
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) cpu[pct imp](CV) smt[pct imp](CV) cache[pct imp](CV) numa[pct imp](CV) system[pct imp](CV)
1 1.00 [ 0.00]( 0.47) 1.00 [ 0.11]( 0.95) 1.00 [ 0.23]( 1.97) 1.01 [ 1.01]( 0.29) 1.00 [ 0.07]( 0.57) 1.01 [ 1.36]( 0.36)
2 1.00 [ 0.00]( 0.10) 1.01 [ 1.14]( 0.27) 0.99 [ -0.84]( 0.51) 1.01 [ 1.05]( 0.50) 1.00 [ 0.24]( 0.75) 1.00 [ -0.29]( 1.22)
4 1.00 [ 0.00]( 1.60) 1.02 [ 2.07]( 1.42) 1.02 [ 1.65]( 0.46) 1.02 [ 2.45]( 0.83) 1.00 [ 0.36]( 1.33) 1.02 [ 2.37]( 0.57)
8 1.00 [ 0.00]( 0.13) 1.00 [ -0.02]( 0.61) 1.00 [ 0.14]( 0.57) 1.01 [ 0.88]( 0.33) 1.00 [ -0.26]( 0.30) 1.01 [ 0.90]( 1.48)
16 1.00 [ 0.00]( 1.69) 1.03 [ 3.10]( 0.69) 1.04 [ 3.66]( 1.36) 1.02 [ 2.36]( 0.62) 1.02 [ 1.61]( 1.63) 1.04 [ 3.77]( 1.00)
32 1.00 [ 0.00]( 0.35) 0.97 [ -3.49]( 0.62) 0.97 [ -3.21]( 0.77) 1.00 [ -0.24]( 3.77) 0.96 [ -4.08]( 4.43) 0.97 [ -2.81]( 3.50)
64 1.00 [ 0.00]( 0.43) 1.00 [ 0.20]( 1.66) 0.99 [ -0.61]( 0.81) 1.03 [ 2.87]( 0.55) 1.02 [ 2.16]( 2.31) 0.98 [ -2.32]( 3.63)
128 1.00 [ 0.00]( 0.31) 1.01 [ 1.44]( 1.33) 1.01 [ 0.72]( 0.46) 1.01 [ 1.33]( 0.67) 1.00 [ 0.38]( 0.58) 1.01 [ 1.44]( 1.35)
256 1.00 [ 0.00]( 1.81) 0.98 [ -2.10]( 1.05) 0.97 [ -2.50]( 0.42) 0.97 [ -3.46]( 0.91) 0.99 [ -0.79]( 0.85) 0.96 [ -3.83]( 0.29)
512 1.00 [ 0.00]( 0.54) 1.00 [ 0.37]( 1.12) 0.99 [ -1.33]( 0.44) 1.00 [ -0.19]( 0.94) 1.01 [ 0.87]( 1.05) 0.99 [ -1.08]( 0.12)
1024 1.00 [ 0.00]( 0.13) 1.01 [ 1.10]( 0.49) 1.00 [ 0.47]( 0.28) 1.00 [ 0.33]( 0.73) 1.00 [ 0.48]( 0.69) 1.00 [ 0.01]( 0.47)

==================================================================

ycsb-mongodb and DeathStarBench do not see any difference in
performance. I'll go and test more NPS modes / more machines.
Meanwhile, please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>

--
Thanks and Regards,
Prateek

Next message: suijingfeng: "Re: [PATCH v4] PCI/VGA: Make the vga_is_firmware_default() less arch-dependent"
Previous message: Tian, Kevin: "RE: [PATCH v2 1/3] iommu: Make single-device group for PASID explicit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]