Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()

From: Herton Krzesinski
Date: Tue Mar 18 2025 - 18:51:55 EST

Next message: Stephen Boyd: "Re: [PATCH v12 4/6] clk: qcom: Add NSS clock Controller driver for IPQ9574"
Previous message: Thorsten Blum: "Re: [PATCH] fork: Remove unnecessary size argument when calling strscpy()"
In reply to: David Laight: "Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()"
Next in thread: David Laight: "Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Mar 18, 2025 at 6:59 PM David Laight
<david.laight.linux@xxxxxxxxx> wrote:
>
> On Sun, 16 Mar 2025 12:09:47 +0100
> Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> > * Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> >
> > > > It does look good in my testing here, I built same kernel I was
> > > > using for testing the original patch (based on 6.14-rc6), this is
> > > > one of the results I got in one of the runs testing on the same
> > > > machine:
> > > >
> > > > CPU RATE SYS TIME sender-receiver
> > > > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2%
> > > > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7%
> > > > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7%
> > > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1%
> > > >
> > > > There are still some variations between runs, which is expected as
> > > > was the same when I tested my patch or in the not aligned case, but
> > > > it's consistently better/higher than the no align case. Looks
> > > > really it's sufficient to align for the higher than or equal 64
> > > > bytes copy case.
> > >
> > > Mind sending a v2 patch with a changelog and these benchmark numbers
> > > added in, and perhaps a Co-developed-by tag with Linus or so?
> >
> > BTW., if you have a test system available, it would be nice to test a
> > server CPU in the Intel spectrum as well. (For completeness mostly, I'd
> > not expect there to be as much alignment sensitivity.)
> >
> > The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's
> > still within the window of microarchitectures we care about. An Intel
> > test would be nice from a similar timeframe as well. Older is probably
> > better in this case, but not too old. :-)
>
> Is that loop doing aligned 'rep movsq' ?
>
> Pretty much all the Intel (non-atom) cpu have some variant of FRSM.
> For FRSM you get double the throughput if the destination is 32byte aligned.
> No other alignment makes any difference.
> The cycle cost is per 16/32 byte block and different families have
> different costs for the first few blocks, after than you get 1 block/clock.
> That goes all the way back to Sandy Bridge and Ivy Bridge.
> I don't think anyone has tried doing that alignment.

The code under copy_user_generic() has mainly two paths, first is FSRM
case where rep movsb is used right away. Otherwise rep_movs_alternative
is used: with 8 bytes or less byte by byte copy is used, between 8-64 bytes
word/8byte copy is used, and for greater or equal than 64 bytes either
rep movsb is used for cpus with ERMS or rep movsq is used when without.
The last case is what Linus' patch touches. And the last case also falls back
to byte by byte copy for the tail of the copy/remaining bytes if any.

For Intel, I was looking and looks like after Sandy Bridge based CPUs
most/almost all have ERMS, and FSRM is something only newer ones have.
So the way back to Ivy Bridge is ERMS and not FSRM.

The patch as expected does not affect newer/last decade Intel CPUs,
since they have ERMS, and will use rep movsb for large copies. The
only Intel CPU I found that runs the code for the alignment case (which
is the last rep movsq case) is a Sandy Bridge based system here
which does not have either erms/fsrm, I tested on one today and saw
no difference at least in my iperf3 benchmark I'm using, so far only the
AMD I tested seems to benefit from the alignment in the case I tested
in practice, and the others I tested don't regress/get worse but no
benefits either.

I also tried the 32 byte write alignment on the same Sandy Bridge based
CPU, I hope to have got the code correct:

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index fc9fb5d06174..75bf7e9e9318 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -74,6 +74,36 @@ SYM_FUNC_START(rep_movs_alternative)
_ASM_EXTABLE_UA( 0b, 1b)

.Llarge_movsq:
+ /* Do the first possibly unaligned word */
+0: movq (%rsi),%rax
+1: movq %rax,(%rdi)
+2: movq 8(%rsi),%rax
+3: movq %rax,8(%rdi)
+4: movq 16(%rsi),%rax
+5: movq %rax,16(%rdi)
+6: movq 24(%rsi),%rax
+7: movq %rax,24(%rdi)
+
+ _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 6b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 7b, .Lcopy_user_tail)
+
+ /* What would be the offset to the aligned destination? */
+ leaq 32(%rdi),%rax
+ andq $-32,%rax
+ subq %rdi,%rax
+
+ /* .. and update pointers and count to match */
+ addq %rax,%rdi
+ addq %rax,%rsi
+ subq %rax,%rcx
+
+ /* make %rcx contain the number of words, %rax the remainder */
movq %rcx,%rax
shrq $3,%rcx
andl $7,%eax

But the code above doesn't perform better than the 8byte only
alignment on the Intel CPU (it's Xeon E5-2667).

I also tested the Linus' patch on other Intel CPU with ERMS (Xeon(R)
Platinum 8280) and there is no difference with/without the patch
as expected. However, I'll try to do a test with alignment before the
rep movsb (erms) in the large case, to see if there is any difference, but
I think the result will be the same as Sandy Bridge.

It takes some time to do all the testing/setup, thus I'm almost settling on
the 8 byte only alignment that was sent by Linus and I validated
already, since I'm so far unable to get better results with other
approaches/trials
If you have any patch you want to be tested let me know and I can apply and
run against my iperf3 test case.

>
> I'm sure I've measured misaligned 64bit writes and got no significant cost.
> It might be one extra clock for writes than cross cache line boundaries.
> Misaligned reads are pretty much 'cost free' - just about measurable
> on the ip-checksum code loop (and IIRC even running a three reads every
> two clocks algorithm).
>
> I don't have access to a similar range of amd chips.
>
> David
>
> >
> > ( Note that the Intel test is not required to apply the fix IMO - we
> > did change alignment patterns ~2 years ago in a5624566431d which
> > regressed. )
> >
> > Thanks,
> >
> > Ingo
> >
>

Next message: Stephen Boyd: "Re: [PATCH v12 4/6] clk: qcom: Add NSS clock Controller driver for IPQ9574"
Previous message: Thorsten Blum: "Re: [PATCH] fork: Remove unnecessary size argument when calling strscpy()"
In reply to: David Laight: "Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()"
Next in thread: David Laight: "Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]