Re: [PATCH] x86/cpufeatures: Add support for fast short rep mov

From: Luck, Tony
Date: Tue Jan 07 2020 - 17:36:11 EST


On Tue, Jan 07, 2020 at 07:40:03PM +0100, Borislav Petkov wrote:
> I don't mind this graph being part of the commit message - it shows
> nicely the speedup even if with some microbenchmark. Or you're not
> adding it just because it is a microbenchmark and not something more
> representative?

I'm not sure it should be archived forever in the commit message.
The benchmark was run on A-step silicon, so may not be representative
of production results.

> > .Lmemmove_begin_forward:
> > + ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
>
> So the enhancement is for string lengths up to two cachelines. Why
> are you limiting this to 32 bytes?
>
> I know, the function handles 32-bytes at a time but what I'd imagine
> here is having the fastest variant upfront which does REP; MOVSB for all
> lengths since FSRM means fast short strings and ERMS - and I'm strongly
> assuming here FSRM *implies* ERMS - means fast "longer" strings, so to
> speak, so FSRM would mean fast *all length* strings in the end, no?
>
> Also, does the copy direction influence the FSRM's REP; MOVSB variant's
> performance? If not, you can do something like this:

Yes FSRM implies ERMS

You can't use REP MOVS for overlapping src/dst strings (not even with the fancy
newer, faster, shinier FSRM version). So your suggestion will not work.

The old memmove code looked something like:

if (len < 32)
copy tail (backwards ... 8/4/2/1 bytes. works for both overlap & non-overlap case)
return
else if overlap src/dst
copy backwards 32-byte unrolled
copy tail
return
else if (ERMS)
REP MOVS;
return
else
unrolled copy 32-byte
copy tail

The new one with my changes looks something like

if (! overlap src/dst)
if (FSRM)
rep movs
return
if (len < 32)
copy tail
return
if (ERMS)
rep movs
return
unrolled copy
else
if (len < 32)
copy tail
return
copy backwards 32-byte unrolled
copy tail

-Tony