x86 copy performance regression

From: Eric Dumazet
Date: Fri May 26 2023 - 11:00:41 EST


Hi Linus

While testing unrelated patches using upstream net-next kernels,
I found a big regression in sendmsg()/recvmsg() caused by a series of yours.

Tested platforms :

Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz

We can see rep_movs_alternative() using more cycles in kernel profiles
than the previous variant (copy_user_enhanced_fast_string, which was
simply using "rep movsb"), and we can not reach line rate (as we
could before the series)


Patch series:

commit a5624566431de76b17862383d9ae254d9606cba9
Merge: 487c20b016dc48230367a7be017f40313e53e3bd
034ff37d34071ff3f48755f728cd229e42a4f15d
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Mon Apr 24 10:39:27 2023 -0700

Merge branch 'x86-rep-insns': x86 user copy clarifications

Merge my x86 user copy updates branch.

IMO this patch seems to think tcp sendmsg() is using small areas.
(tcp_sendmsg() usually copy 32KB at a time, if order-3 pages
allocations are possible)

commit adfcf4231b8cbc2d9c1e7bfaa965b907e60639eb
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sat Apr 15 13:14:59 2023 -0700

x86: don't use REP_GOOD or ERMS for user memory copies

The modern target to use is FSRM (Fast Short REP MOVS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>



The issue is that (some of) our platforms do have ERMS but not FSRM

Test run on 6.3 (single TCP flow, sending 32 MB of payload to a
zerocopy receiver to make sure the receiver is not the bottleneck).
100Gbit link speed.

# perf stat taskset 02 tcp_mmap -H 2002:a05:6608:295::

Performance counter stats for 'taskset 02 ./tcp_mmap -H 2002:a05:6608:295::':

2,815.79 msec task-clock # 0.936
CPUs utilized
2,370 context-switches # 841.682
/sec
1 cpu-migrations # 0.355
/sec
127 page-faults # 45.103
/sec
10,106,383,352 cycles # 3.589
GHz
6,936,487,168 instructions # 0.69
insn per cycle
1,206,325,691 branches # 428.414
M/sec
10,327,112 branch-misses # 0.86% of
all branches

3.007810265 seconds time elapsed

0.005158000 seconds user
2.406125000 seconds sys


Same test from linux-6.4-rc1

# perf stat taskset 02 tcp_mmap -H 2002:a05:6608:295::

Performance counter stats for 'taskset 02 ./tcp_mmap -H 2002:a05:6608:295::':

4,039.73 msec task-clock # 1.000
CPUs utilized
12 context-switches # 2.970
/sec
1 cpu-migrations # 0.248
/sec
130 page-faults # 32.180
/sec
14,639,828,754 cycles # 3.624
GHz
19,443,379,653 instructions # 1.33
insn per cycle
1,931,003,961 branches # 478.003
M/sec
12,349,476 branch-misses # 0.64% of
all branches

4.040825111 seconds time elapsed

0.012496000 seconds user
3.560336000 seconds sys

Thanks.