Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic()

From: David Laight
Date: Wed Mar 19 2025 - 09:07:53 EST


On Tue, 18 Mar 2025 19:50:41 -0300
Herton Krzesinski <hkrzesin@xxxxxxxxxx> wrote:

> On Tue, Mar 18, 2025 at 6:59 PM David Laight
> <david.laight.linux@xxxxxxxxx> wrote:
...
> For Intel, I was looking and looks like after Sandy Bridge based CPUs
> most/almost all have ERMS, and FSRM is something only newer ones have.
> So the way back to Ivy Bridge is ERMS and not FSRM.

ERMS behaves much the same as FSRM.
The cost of the first tranfser is a few clocks higher (maybe 30 not 24),
and (IIRC) the overhead for the next couple of blocks is a bit bigger.
Reading Agner's tables (again) Haswell will do 32 bytes/clock
(for an aligned destination) whereas Sandy/Ivy bridge 'only' do 16..
I doubt it is enough to treat them differently.

The real issue with using (aligned) 'rep movsq' was the 140 clock
setup cost on P4 netburst (and no one cares about that and more).
I don't think anything else really needs an open coded loop.
There is no hint in the tables of the AND cpu (going way back)
having long setup times.

The differing cost of different ways of aligning the copy will show
up most on short copies.
You also need to benchmarks differing sizes/alignments - otherwise the branch
predictor will get it right every time - which it doesn't in 'real code'.

David