Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store
From: David Laight
Date: Wed Mar 26 2025 - 18:45:39 EST
On Tue, 25 Mar 2025 19:42:09 -0300
Herton Krzesinski <hkrzesin@xxxxxxxxxx> wrote:
...
> I have been trying to also measure the impact of changes like above, however,
> it seems I don't get improvement or it's limited due impact of
> profiling, I tried
> to uninline/move copy_user_generic() like this:
If you use the PERF_COUNT_HW_CPU_CYCLES counter bracketed by 'mfence'
you can get reasonably consistent cycle counts for short sequences.
The problem here is that you need the specific cpu that is causing issues.
Probably zen2 or zen3.
Benchmarking 'rep movsb' on a zen5 can be summarised:
Test overhead: 195 clocks ('rep movb' asm with a 'nop') subtracted from the
other values.
length clocks
0 7
1..3f 5
40 4
41..7f 5
80..1ff 39 (except 16c with is 4 clocks faster!)
200 38
201..23f 40
240 38
241..27f 41
280 39
The pattern then continues much the same, increasing by 1 clock every 64 bytes
with the multiple of 64 being a bit cheaper.
With a 'sailing wind' a copy loop should do 8 bytes/clock.
(Faster if the cpu supports more than one write/clock.)
So might be faster for lengths between 128 and ~256.
Misaligning the addresses doesn't usually make any difference.
(There is a small penalty for destinations in the last cache line of a page.)
But there is strange oddity.
If (dest - src) % 4096 is between 1 and 63 then short copies are 55 clocks
jumping to 75 at 128 bytes and then increasing slowly.
(I think that matches what I've seen.)
David