Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()
From: David Laight
Date: Tue Mar 18 2025 - 17:59:37 EST
On Sun, 16 Mar 2025 12:09:47 +0100
Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> * Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> > > It does look good in my testing here, I built same kernel I was
> > > using for testing the original patch (based on 6.14-rc6), this is
> > > one of the results I got in one of the runs testing on the same
> > > machine:
> > >
> > > CPU RATE SYS TIME sender-receiver
> > > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2%
> > > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7%
> > > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7%
> > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1%
> > >
> > > There are still some variations between runs, which is expected as
> > > was the same when I tested my patch or in the not aligned case, but
> > > it's consistently better/higher than the no align case. Looks
> > > really it's sufficient to align for the higher than or equal 64
> > > bytes copy case.
> >
> > Mind sending a v2 patch with a changelog and these benchmark numbers
> > added in, and perhaps a Co-developed-by tag with Linus or so?
>
> BTW., if you have a test system available, it would be nice to test a
> server CPU in the Intel spectrum as well. (For completeness mostly, I'd
> not expect there to be as much alignment sensitivity.)
>
> The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's
> still within the window of microarchitectures we care about. An Intel
> test would be nice from a similar timeframe as well. Older is probably
> better in this case, but not too old. :-)
Is that loop doing aligned 'rep movsq' ?
Pretty much all the Intel (non-atom) cpu have some variant of FRSM.
For FRSM you get double the throughput if the destination is 32byte aligned.
No other alignment makes any difference.
The cycle cost is per 16/32 byte block and different families have
different costs for the first few blocks, after than you get 1 block/clock.
That goes all the way back to Sandy Bridge and Ivy Bridge.
I don't think anyone has tried doing that alignment.
I'm sure I've measured misaligned 64bit writes and got no significant cost.
It might be one extra clock for writes than cross cache line boundaries.
Misaligned reads are pretty much 'cost free' - just about measurable
on the ip-checksum code loop (and IIRC even running a three reads every
two clocks algorithm).
I don't have access to a similar range of amd chips.
David
>
> ( Note that the Intel test is not required to apply the fix IMO - we
> did change alignment patterns ~2 years ago in a5624566431d which
> regressed. )
>
> Thanks,
>
> Ingo
>