Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when without FSRM/ERMS)

From: Herton Krzesinski
Date: Thu Mar 20 2025 - 13:51:54 EST


On Thu, Mar 20, 2025 at 11:36 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@xxxxxxxxxx> wrote:
> > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> > index fc9fb5d06174..b8f74d80f35c 100644
> > --- a/arch/x86/lib/copy_user_64.S
> > +++ b/arch/x86/lib/copy_user_64.S
> > @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
> > _ASM_EXTABLE_UA( 0b, 1b)
> >
> > .Llarge_movsq:
> > + /* Do the first possibly unaligned word */
> > +0: movq (%rsi),%rax
> > +1: movq %rax,(%rdi)
> > +
> > + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> > + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> > +
> > + /* What would be the offset to the aligned destination? */
> > + leaq 8(%rdi),%rax
> > + andq $-8,%rax
> > + subq %rdi,%rax
> > +
> > + /* .. and update pointers and count to match */
> > + addq %rax,%rdi
> > + addq %rax,%rsi
> > + subq %rax,%rcx
> > +
> > + /* make %rcx contain the number of words, %rax the remainder */
> > movq %rcx,%rax
> > shrq $3,%rcx
> > andl $7,%eax
>
> The patch looks fine to me, but there is more to do if you are up for it.
>
> It was quite some time since I last seriously played with the area and
> I don't remember all the details, on top of that realities of uarchs
> probably improved.
>
> That said, have you experimented with aligning the target to 16 bytes
> or more bytes?

Yes I tried to do 32-byte write aligned on an old Xeon (Sandy Bridge based)
and got no improvement at least in the specific benchmark I'm doing here.
Also after your question here I tried 16-byte/32-byte on the AMD cpu as
well and got no difference from the 8-byte alignment, same bench as well.
I tried to do 8-byte alignment for the ERMS case on Intel and got no
difference on the systems I tested. I'm not saying it may not improve in
some other case, just that in my specific testing I couldn't tell/measure
any improvement.

>
> Moreover, I have some recollection that there were uarchs with ERMS
> which also liked the target to be aligned -- as in perhaps this should
> be done regardless of FSRM?

Where I tested I didn't see improvements but may be there is some case,
but I didn't have any.

>
> And most importantly memset, memcpy and clear_user would all use a
> revamp and they are missing rep handling for bigger sizes (I verified
> they *do* show up). Not only that, but memcpy uses overlapping stores
> while memset just loops over stuff.
>
> I intended to sort it out long time ago and maybe will find some time
> now that I got reminded of it, but I would be deligthed if it got
> picked up.
>
> Hacking this up is just some screwing around, the real time consuming
> part is the benchmarking so I completely understand if you are not
> interested.

Yes, the most time you spend is on benchmarking. May be later I could
try to take a look but will not put any promises on it.

>
> --
> Mateusz Guzik <mjguzik gmail.com>
>