Re: [PATCH v3 3/3] riscv: optimized memset
From: Matteo Croce
Date: Tue Jun 22 2021 - 21:15:15 EST
On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: Nick Kossifidis
> > Sent: 22 June 2021 02:08
> >
> > Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > > +
> > > +void *__memset(void *s, int c, size_t count)
> > > +{
> > > + union types dest = { .u8 = s };
> > > +
> > > + if (count >= MIN_THRESHOLD) {
> > > + const int bytes_long = BITS_PER_LONG / 8;
> >
> > You could make 'const int bytes_long = BITS_PER_LONG / 8;'
>
> What is wrong with sizeof (long) ?
> ...
Nothing, I guess that BITS_PER_LONG is just (sizeof(long) * 8) anyway
> > > + unsigned long cu = (unsigned long)c;
> > > +
> > > + /* Compose an ulong with 'c' repeated 4/8 times */
> > > + cu |= cu << 8;
> > > + cu |= cu << 16;
> > > +#if BITS_PER_LONG == 64
> > > + cu |= cu << 32;
> > > +#endif
> > > +
> >
> > You don't have to create cu here, you'll fill dest buffer with 'c'
> > anyway so after filling up enough 'c's to be able to grab an aligned
> > word full of them from dest, you can just grab that word and keep
> > filling up dest with it.
>
> That will be a lot slower - especially if run on something like x86.
> A write-read of the same size is optimised by the store-load forwarder.
> But the byte write, word read will have to go via the cache.
>
> You can just write:
> cu = (unsigned long)c * 0x0101010101010101ull;
> and let the compiler sort out the best way to generate the constant.
>
Interesting. I see that most compilers do an integer multiplication,
is it faster than three shift and three or?
clang on riscv generates even more instructions to create the immediate:
unsigned long repeat_shift(int c)
{
unsigned long cu = (unsigned long)c;
cu |= cu << 8;
cu |= cu << 16;
cu |= cu << 32;
return cu;
}
unsigned long repeat_mul(int c)
{
return (unsigned long)c * 0x0101010101010101ull;
}
repeat_shift:
slli a1, a0, 8
or a0, a0, a1
slli a1, a0, 16
or a0, a0, a1
slli a1, a0, 32
or a0, a0, a1
ret
repeat_mul:
lui a1, 4112
addiw a1, a1, 257
slli a1, a1, 16
addi a1, a1, 257
slli a1, a1, 16
addi a1, a1, 257
mul a0, a0, a1
ret
> >
> > > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > > + /* Fill the buffer one byte at time until the destination
> > > + * is aligned on a 32/64 bit boundary.
> > > + */
> > > + for (; count && dest.uptr % bytes_long; count--)
> >
> > You could reuse & mask here instead of % bytes_long.
> >
> > > + *dest.u8++ = c;
> > > +#endif
> >
> > I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> > memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> > and even if it did we are talking about a loop that will run just a few
> > times to reach the alignment boundary (worst case scenario it'll run 7
> > times), I don't think we gain much here, even for archs that have
> > efficient unaligned access.
>
> With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth
> even checking the alignment.
> While aligning the copy will be quicker for an unaligned buffer they
> almost certainly don't happen often enough to worry about.
> In any case you'd want to do a misaligned word write to the start
> of the buffer - not separate byte writes.
> Provided the buffer is long enough you can also do a misaligned write
> to the end of the buffer before filling from the start.
>
I don't understand this one, a misaligned write here is ~30x slower
than an aligned one because it gets trapped and emulated in SBI.
How can this be convenient?
> I suspect you may need either barrier() or use a ptr to packed
> to avoid the perverted 'undefined behaviour' fubar.'
>
Which UB are you referring to?
Regards,
--
per aspera ad upstream