RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

From: David Laight
Date: Tue Sep 12 2023 - 15:41:27 EST


From: Linus Torvalds
> Sent: 12 September 2023 19:49
>
> On Mon, 11 Sept 2023 at 03:38, David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> > The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.

Interestingly exactly the same test changed its mind for no reason!
While I got repeatable clock counts (+/-1 in the low hundreds) when
looping the test (ie the 'hot cache' cases), I couldn't decide on
the exact overhead to get an accurate bytes/clock.
OTOH it was between 30 and 35 - so pretty much likely to be 32.

> Note that the hard case for 'rep movsq' is when the stores cross a
> cacheline (or worse yet, a page) boundary.

Page faults on misaligned transfers are what makes them hard to implement.
OTOH that is a 'hardware problem' not specifically worth optimising
for - since unlikely.

> That is what makes 'rep movsb' fundamentally simpler in theory. The
> natural reaction is "but movsq does things 8 bytes at a time", but
> once you start doing any kind of optimizations that are actually based
> on bigger areas, the byte counts are actually simpler. You can always
> do them as masked writes up to whatever boundary you like, and just
> restart. There are never any "what about the straddling bytes" issues.

What I found seemed to imply that 'rep movsq' used the same internal
logic as 'rep movsb' (pretty easy to do in hardware) even though I
don't think the EMRS docs say anything about that.
it may well be cpu specific - but I'd expect later ones to be the same.
(AMD cpu will be different, and I don't have access to anything new.)

> That's one of the dangers with benchmarking. Do you benchmark the
> unaligned cases? How much do they matter in real life? Do they even
> happen?

A microbenchmark can tell you how bad they are.
Real life will almost certainly be different.

I did some buffer offset measurements (the buffers should have been
8k aligned).
For reasonable length copies (1024 bytes) the source alignment made
almost no difference.
What did matter was the destination alignment, anything other than
a multiple or 32 halved the transfer speed (regardless of the
source alignment).
So 32n+1, 32n+8 and 32n+16 were all equally bad.
Some values reduced it further, possibly writes affecting the
read prefetches - who knows.

Anyway it seemed like there is a pipelined barrel shifter on
the read side (so it could read 32 bytes/clock) but the write
side was having to do two writes of each misaligned block.

> And that's entirely ignoring any "cold vs hot caches" etc issues, or
> the "what is the cost of access _after_ the memcpy/memset".

I count the clocks for 5 iterations.
The first 'cold cache' is massively slower.
The rest are pretty much identical - so 5 is plenty.
For microbenchmarks you really want to assume 'hot cache'.
The cache loads/invalidates will (usually) be needed sometime
so no point destroying a benchmark by including them.
Especially when looking a short code loops.

I'm definitely against running 10000 iterations and measuring wall time.
It really doesn't tell you anything useful.
I can't even use rdtsc - I can't lock the cpu clock frequency.

> Or, in the case of the kernel, our issues with "function calls can now
> be surprisingly expensive, and if we can inline things it can win back
> 20 cycles from a forced mispredict".

And then glibc probably adds a pile of wrappers to convert open()
into openat(O_EMPTY_PATH,....) for no good reason.
Undoing all the good work.

And then there is the code I've written for an fpga soft-core
which has a limited number of clock to do its work.
Required careful analysis of static branch prediction (having
found the hidden menu to disable the dynamic one) to ensure
that the worst case paths were fast enough - rather than making
the common paths fastest.

...
> So beware microbenchmarks. That's true in general, but it's
> _particularly_ true of memset/memcpy.

I see the problem benchmarks the ones that are massively unrolled
and run fast(ish) in a benchmark but just destroy the i-cache.

If I'm mircobenchmarking I'm trying to see if i can get the
cpu to execute as many instructions in parallel as it should
so that the code loop is limited by (eg) the number of reads.
If you can get a short loop to run 'as fast as possible' there
is no point doing a 1980's unroll.

If anyone had done a microbenchmark of the 'adc' list in the ip
checksum code (eg on Intel core-2) they'd have panicked about
each one taking two clocks.
Even with more recent cpu you need to adc to alternate registers.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)