Re: [PATCH 0/2] sched: Optionally skip uclamp logic in fast path
From: Lukasz Luba
Date: Mon Jun 22 2020 - 05:06:39 EST
Hi Qais,
On 6/18/20 8:55 PM, Qais Yousef wrote:
This series attempts to address the report that uclamp logic could be expensive
sometimes and shows a regression in netperf UDP_STREAM under certain
conditions.
The first patch is a fix for how struct uclamp_rq is initialized which is
required by the 2nd patch which contains the real 'fix'.
Worth noting that the root cause of the overhead is believed to be system
specific or related to potential certain code/data layout issues, leading to
worse I/D $ performance.
Different systems exhibited different behaviors and the regression did
disappear in certain kernel version while attempting to reporoduce.
More info can be found here:
20200616110824.dgkkbyapn3io6wik@e107158-lin/">https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
Having the static key seemed the best thing to do to ensure the effect of
uclamp is minimized for kernels that compile it in but don't have a userspace
that uses it, which will allow distros to distribute uclamp capable kernels by
default without having to compromise on performance for some systems that could
be affected.
Thanks
--
Qais Yousef
I have given it a try on my machine
(HP server 2 socket 24 CPUs X86 64bit 4 NUMA nodes, AMD Opteron 6174,
L2 512KB/cpu, L3 6MB/node, RAM 40GB/node).
Kernel v5.7-rc7 with Open Suse 15.1 distro config.
The numa control for pinning tasks has not been used.
The results are better than the last time I have checked this uclamp
issue [1]. Here are the results for kernel built based on Suse 15.1
distor config + uclamp tasks + task groups (similar to 3rd kernel from
[1] tests).
They are really good in terms of nteperf-udp performance and how they
deal with statistical noise due to context switch and/or tasks jumping
around CPUs.
The netperf-udp (100 tests for each udp size):
./v5.7-rc7-base ./v5.7-rc7-uclamp-tsk-grp-fix
Hmean send-64 62.36 ( 0.00%) 66.27 * 6.27%*
Hmean send-128 124.24 ( 0.00%) 132.03 * 6.27%*
Hmean send-256 244.81 ( 0.00%) 261.21 * 6.70%*
Hmean send-1024 922.18 ( 0.00%) 985.84 * 6.90%*
Hmean send-2048 1716.61 ( 0.00%) 1811.30 * 5.52%*
Hmean send-3312 2564.73 ( 0.00%) 2690.62 * 4.91%*
Hmean send-4096 2967.01 ( 0.00%) 3128.71 * 5.45%*
Hmean send-8192 4834.31 ( 0.00%) 5028.15 * 4.01%*
Hmean send-16384 7569.17 ( 0.00%) 7734.05 * 2.18%*
Hmean recv-64 62.35 ( 0.00%) 66.27 * 6.28%*
Hmean recv-128 124.24 ( 0.00%) 132.03 * 6.27%*
Hmean recv-256 244.79 ( 0.00%) 261.20 * 6.70%*
Hmean recv-1024 922.10 ( 0.00%) 985.82 * 6.91%*
Hmean recv-2048 1716.61 ( 0.00%) 1811.29 * 5.52%*
Hmean recv-3312 2564.46 ( 0.00%) 2690.60 * 4.92%*
Hmean recv-4096 2967.00 ( 0.00%) 3128.71 * 5.45%*
Hmean recv-8192 4834.06 ( 0.00%) 5028.05 * 4.01%*
Hmean recv-16384 7568.70 ( 0.00%) 7733.69 * 2.18%*
Statistics info showing performance when there is the context
switch noise and/or tasks are jumping around CPUs. This is from
netperf-udp benchmark but running only 64B test case once with
tracing, but repeated 100 times in bash loop.
Traced functions performance combined and presented in statistical form
(padas dataframe describe() output).
It can be compared with basic kernel results or the similar kernel
(but w/o this fix) results present at [1].
For completeness I also put them below.
kernel with uclamp task + task group + this fix
activate_task
Hit Time_us Avg_us s^2_us
count 101.00 101.00 101.00 101.00
mean 26,269.33 19,397.98 1.15 0.51
std 123,464.10 90,121.64 0.36 0.19
min 101.00 161.49 0.37 0.03
50% 408.00 479.45 1.26 0.50
75% 720.00 704.05 1.40 0.60
90% 1,795.00 1,071.86 1.57 0.72
95% 3,688.00 1,776.87 1.61 0.79
99% 733,737.00 518,448.60 1.73 1.03
max 756,631.00 537,865.40 1.76 1.06
deactivate_task
Hit Time_us Avg_us s^2_us
count 101.00 101.00 101.00 101.00
mean 111,714.44 55,791.32 0.80 0.27
std 307,358.56 153,230.31 0.26 0.14
min 88.00 91.73 0.31 0.00
50% 464.00 381.30 0.90 0.29
75% 1,118.00 622.70 1.01 0.36
90% 517,991.00 255,669.50 1.10 0.44
95% 997,663.00 484,013.20 1.12 0.47
99% 1,189,980.00 578,987.30 1.14 0.51
max 1,422,640.00 686,828.60 1.16 0.60
Basic kernel traced functions performance
(no uclamp, no fixes, just built based on distro config)
activate_task
Hit Time_us Avg_us s^2_us
count 138.00 138.00 138.00 138.00
mean 20,387.44 14,587.33 1.15 0.53
std 114,980.19 81,427.51 0.42 0.23
min 110.00 181.68 0.32 0.00
50% 411.00 461.55 1.32 0.54
75% 881.75 760.08 1.47 0.66
90% 2,885.60 1,302.03 1.61 0.80
95% 55,318.05 41,273.41 1.66 0.92
99% 501,660.04 358,939.04 1.77 1.09
max 1,131,457.00 798,097.30 1.80 1.42
deactivate_task
Hit Time_us Avg_us s^2_us
count 138.00 138.00 138.00 138.00
mean 81,828.83 39,991.61 0.81 0.28
std 260,130.01 126,386.89 0.28 0.14
min 97.00 92.35 0.26 0.00
50% 424.00 340.35 0.94 0.30
75% 1,062.25 684.98 1.05 0.37
90% 330,657.50 168,320.94 1.11 0.46
95% 748,920.70 359,498.23 1.15 0.51
99% 1,094,614.76 528,459.50 1.21 0.56
max 1,630,473.00 789,476.50 1.25 0.60
Old kernel (w/o fix, uclamp task + tsg grp) which had this uclamp issue
activate_task
Hit Time_us Avg_us s^2_us
count 273.00 273.00 273.00 273.00
mean 15,958.34 16,471.84 1.58 0.67
std 105,096.88 108,322.03 0.43 0.32
min 3.00 4.96 0.41 0.00
50% 245.00 400.23 1.70 0.64
75% 384.00 565.53 1.85 0.78
90% 1,602.00 1,069.08 1.95 0.95
95% 3,403.00 1,573.74 2.01 1.13
99% 589,484.56 604,992.57 2.11 1.75
max 1,035,866.00 1,096,975.00 2.40 3.08
deactivate_task
Hit Time_us Avg_us s^2_us
count 273.00 273.00 273.00 273.00
mean 94,607.02 63,433.12 1.02 0.34
std 325,130.91 216,844.92 0.28 0.16
min 2.00 2.79 0.29 0.00
50% 244.00 291.49 1.11 0.36
75% 496.00 448.72 1.19 0.43
90% 120,304.60 82,964.94 1.25 0.55
95% 945,480.60 626,793.58 1.33 0.60
99% 1,485,959.96 1,010,615.72 1.40 0.68
max 2,120,682.00 1,403,280.00 1.80 1.11
Based on these statistics this fix has better distribution in almost all
marked points and better performance results for netperf-udp.
I can run also tests for kernel just with uclamp tasks today, if it's
needed.
Regards,
Lukasz
[1]
https://lore.kernel.org/lkml/d9c951da-87eb-ab20-9434-f15b34096d66@xxxxxxx/