Re: [LKP] [SUNRPC] 0472e47660: fsmark.app_overhead 16.0% regression

From: Xing Zhengjun
Date: Wed Jul 24 2019 - 01:18:01 EST




On 7/12/2019 2:42 PM, Xing Zhengjun wrote:
Hi Trond,

ÂÂÂ I attached perf-profile part big changes, hope it is useful for analyzing the issue.

Ping...



In testcase: fsmark
on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 384G memory
with following parameters:

ÂÂÂÂÂÂÂ iterations: 20x
ÂÂÂÂÂÂÂ nr_threads: 64t
ÂÂÂÂÂÂÂ disk: 1BRD_48G
ÂÂÂÂÂÂÂ fs: xfs
ÂÂÂÂÂÂÂ fs2: nfsv4
ÂÂÂÂÂÂÂ filesize: 4M
ÂÂÂÂÂÂÂ test_size: 80G
ÂÂÂÂÂÂÂ sync_method: fsyncBeforeClose
ÂÂÂÂÂÂÂ cpufreq_governor: performance

test-description: The fsmark is a file system benchmark to test synchronous write workloads, for example, mail servers workload.
test-url: https://sourceforge.net/projects/fsmark/

commit:
 e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()")
 0472e47660 ("SUNRPC: Convert socket page send code to use iov_iter()")

e791f8e9380d945e 0472e476604998c127f3c80d291
---------------- ---------------------------
ÂÂÂÂÂÂÂÂ %stddevÂÂÂÂ %changeÂÂÂÂÂÂÂÂ %stddev
ÂÂÂÂÂÂÂÂÂÂÂÂ \ÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ \
ÂÂÂ 527.29ÂÂÂÂÂÂÂÂÂÂ -22.6%ÂÂÂÂ 407.96ÂÂÂÂÂÂÂ fsmark.files_per_sec
ÂÂÂÂÂ 1.97 Â 11%ÂÂÂÂÂ +0.9ÂÂÂÂÂÂÂ 2.88 ÂÂ 4% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +0.9ÂÂÂÂÂÂÂ 0.93 ÂÂ 4% perf-profile.calltrace.cycles-pp.tcp_write_xmit.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages
ÂÂÂÂÂ 2.11 Â 10%ÂÂÂÂÂ +0.9ÂÂÂÂÂÂÂ 3.05 ÂÂ 4% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
ÂÂÂÂÂ 5.29 ÂÂ 2%ÂÂÂÂÂ +1.2ÂÂÂÂÂÂÂ 6.46 ÂÂ 7% perf-profile.calltrace.cycles-pp.svc_recv.nfsd.kthread.ret_from_fork
ÂÂÂÂÂ 9.61 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.70 ÂÂ 2% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
ÂÂÂÂÂ 9.27 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.40 ÂÂ 2% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork
ÂÂÂÂ 34.52 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.calltrace.cycles-pp.ret_from_fork
ÂÂÂÂ 34.52 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.4ÂÂÂÂÂÂÂ 3.41 ÂÂ 4% perf-profile.calltrace.cycles-pp.memcpy_erms.memcpy_from_page._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.4ÂÂÂÂÂÂÂ 3.44 ÂÂ 4% perf-profile.calltrace.cycles-pp.memcpy_from_page._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.5ÂÂÂÂÂÂÂ 3.54 ÂÂ 4% perf-profile.calltrace.cycles-pp._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages
ÂÂÂÂÂ 2.30 ÂÂ 5%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.02 ÂÂ 3% perf-profile.calltrace.cycles-pp.__rpc_execute.rpc_async_schedule.process_one_work.worker_thread.kthread
ÂÂÂÂÂ 2.30 ÂÂ 5%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.02 ÂÂ 3% perf-profile.calltrace.cycles-pp.rpc_async_schedule.process_one_work.worker_thread.kthread.ret_from_fork
ÂÂÂÂÂ 1.81 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.59 ÂÂ 4% perf-profile.calltrace.cycles-pp.call_transmit.__rpc_execute.rpc_async_schedule.process_one_work.worker_thread
ÂÂÂÂÂ 1.80 ÂÂ 3%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.59 ÂÂ 3% perf-profile.calltrace.cycles-pp.xprt_transmit.call_transmit.__rpc_execute.rpc_async_schedule.process_one_work
ÂÂÂÂÂ 1.73 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.54 ÂÂ 4% perf-profile.calltrace.cycles-pp.xs_tcp_send_request.xprt_transmit.call_transmit.__rpc_execute.rpc_async_schedule
ÂÂÂÂÂ 1.72 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.54 ÂÂ 4% perf-profile.calltrace.cycles-pp.xs_sendpages.xs_tcp_send_request.xprt_transmit.call_transmit.__rpc_execute
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +5.4ÂÂÂÂÂÂÂ 5.42 ÂÂ 4% perf-profile.calltrace.cycles-pp.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages.xs_tcp_send_request
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +5.5ÂÂÂÂÂÂÂ 5.52 ÂÂ 4% perf-profile.calltrace.cycles-pp.tcp_sendmsg.sock_sendmsg.xs_sendpages.xs_tcp_send_request.xprt_transmit
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +5.5ÂÂÂÂÂÂÂ 5.53 ÂÂ 4% perf-profile.calltrace.cycles-pp.sock_sendmsg.xs_sendpages.xs_tcp_send_request.xprt_transmit.call_transmit
ÂÂÂÂÂ 9.61 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.70 ÂÂ 2% perf-profile.children.cycles-pp.worker_thread
ÂÂÂÂÂ 9.27 ÂÂ 5%ÂÂÂÂÂ +3.1ÂÂÂÂÂÂ 12.40 ÂÂ 2% perf-profile.children.cycles-pp.process_one_work
ÂÂÂÂÂ 6.19ÂÂÂÂÂÂÂÂÂÂÂ +3.2ÂÂÂÂÂÂÂ 9.40 ÂÂ 4% perf-profile.children.cycles-pp.memcpy_erms
ÂÂÂÂ 34.53 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.children.cycles-pp.ret_from_fork
ÂÂÂÂ 34.52 ÂÂ 4%ÂÂÂÂÂ +3.3ÂÂÂÂÂÂ 37.78 ÂÂ 2% perf-profile.children.cycles-pp.kthread
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.5ÂÂÂÂÂÂÂ 3.46 ÂÂ 4% perf-profile.children.cycles-pp.memcpy_from_page
ÂÂÂÂÂ 0.00ÂÂÂÂÂÂÂÂÂÂÂ +3.6ÂÂÂÂÂÂÂ 3.56 ÂÂ 4% perf-profile.children.cycles-pp._copy_from_iter_full
ÂÂÂÂÂ 2.47 ÂÂ 4%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.18 ÂÂ 3% perf-profile.children.cycles-pp.__rpc_execute
ÂÂÂÂÂ 2.30 ÂÂ 5%ÂÂÂÂÂ +3.7ÂÂÂÂÂÂÂ 6.02 ÂÂ 3% perf-profile.children.cycles-pp.rpc_async_schedule
ÂÂÂÂÂ 1.90 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.67 ÂÂ 3% perf-profile.children.cycles-pp.call_transmit
ÂÂÂÂÂ 1.89 ÂÂ 3%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.66 ÂÂ 3% perf-profile.children.cycles-pp.xprt_transmit
ÂÂÂÂÂ 1.82 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.62 ÂÂ 3% perf-profile.children.cycles-pp.xs_tcp_send_request
ÂÂÂÂÂ 1.81 ÂÂ 4%ÂÂÂÂÂ +3.8ÂÂÂÂÂÂÂ 5.62 ÂÂ 3% perf-profile.children.cycles-pp.xs_sendpages
ÂÂÂÂÂ 0.21 Â 17%ÂÂÂÂÂ +5.3ÂÂÂÂÂÂÂ 5.48 ÂÂ 4% perf-profile.children.cycles-pp.tcp_sendmsg_locked
ÂÂÂÂÂ 0.25 Â 18%ÂÂÂÂÂ +5.3ÂÂÂÂÂÂÂ 5.59 ÂÂ 3% perf-profile.children.cycles-pp.tcp_sendmsg
ÂÂÂÂÂ 0.26 Â 16%ÂÂÂÂÂ +5.3ÂÂÂÂÂÂÂ 5.60 ÂÂ 3% perf-profile.children.cycles-pp.sock_sendmsg
ÂÂÂÂÂ 1.19 ÂÂ 5%ÂÂÂÂÂ +0.5ÂÂÂÂÂÂÂ 1.68 ÂÂ 3% perf-profile.self.cycles-pp.get_page_from_freelist
ÂÂÂÂÂ 6.10ÂÂÂÂÂÂÂÂÂÂÂ +3.2ÂÂÂÂÂÂÂ 9.27 ÂÂ 4% perf-profile.self.cycles-pp.memcpy_erms


On 7/9/2019 10:39 AM, Xing Zhengjun wrote:
Hi Trond,

On 7/8/2019 7:44 PM, Trond Myklebust wrote:
I've asked several times now about how to interpret your results. As far as I can tell from your numbers, the overhead appears to be entirely contained in the NUMA section of your results.
IOW: it would appear to be a scheduling overhead due to NUMA. I've been asking whether or not that is a correct interpretation of the numbers you published.
Thanks for your feedback. I used the same hardware and the same test parameters to test the two commits:
ÂÂÂ e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()")
ÂÂÂ 0472e47660 ("SUNRPC: Convert socket page send code to use iov_iter()")

If it is caused by NUMA, why only commit 0472e47660 throughput is decreased? The filesystem we test is NFS, commit 0472e47660 is related with the network, could you help to check if have any other clues for the regression. Thanks.



--
Zhengjun Xing