Re: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes

From: Alexey Dobriyan
Date: Thu Jun 18 2020 - 17:01:57 EST


On Thu, Jun 18, 2020 at 04:39:35PM +0000, David Laight wrote:
> From: Alexey Dobriyan
> > Sent: 18 June 2020 14:17
> ...
> > > > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> > > > index fff28c6f73a2..b0dfac3d3df7 100644
> > > > --- a/arch/x86/lib/usercopy_64.c
> > > > +++ b/arch/x86/lib/usercopy_64.c
> > > > @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
> > > > asm volatile(
> > > > " testq %[size8],%[size8]\n"
> > > > " jz 4f\n"
> > > > + " .align 16\n"
> > > > "0: movq $0,(%[dst])\n"
> > > > " addq $8,%[dst]\n"
> > > > " decl %%ecx ; jnz 0b\n"
> > >
> > > You can do better that that loop.
> > > Change 'dst' to point to the end of the buffer, negate the count
> > > and divide by 8 and you get:
> > > "0: movq $0,($[dst],%%ecx,8)\n"
> > > " add $1,%%ecx"
> > > " jnz 0b\n"
> > > which might run at one iteration per clock especially on cpu that pair
> > > the add and jnz into a single uop.
> > > (You need to use add not inc.)
> >
> > /dev/zero should probably use REP STOSB etc just like everything else.
>
> Almost certainly it shouldn't, and neither should anything else.
> Potentially it could use whatever memset() is patched to.
> That MIGHT be 'rep stos' on some cpu variants, but in general
> it is slow.

Yes, that's what I meant: alternatives choosing REP variant.
memset loops are so 21-st century.