RE: generic strncpy - off-by-one error

From: Peter Kjellerstedt
Date: Wed Aug 20 2003 - 02:53:45 EST


[ I combined all my answers to your four mails into this one. ]

> -----Original Message-----
> From: Timothy Miller [mailto:miller@xxxxxxxxxxxxxx]
> Sent: Monday, August 18, 2003 18:07
> To: Peter Kjellerstedt
> Cc: 'Willy Tarreau'; linux-kernel mailing list
> Subject: Re: generic strncpy - off-by-one error
>
> Peter Kjellerstedt wrote:
> > For loops 2.867568 5.620561 8.128734 28.286289
> > Multi byte fill 2.868031 5.670782 6.312027 11.336015
> >
> > And here are the numbers for my P4:
> >
> > For loops 3.060262 5.927378 8.796814 30.659818
> > Multi byte fill 3.126607 5.898459 7.096685 13.135379
> >
> > So there is no doubt that the multi byte version is a clear
> > winner (which was expected, I suppose).
>
> Cool! Hey, is this just an exercise, or are we actually going to
> use this? I would be very happy to have something I contributed
> to put into the kernel. :)

I have no idea. But it sure seems like the generic strncpy() could
use some improvement (and probably strcpy() too). However, I guess
one can argue that it is better to keep the generic implementations
as simple as possible, and then have each architecture implement
their own optimized versions.

> > Here is the code that I used:
> >
> > char *strncpy(char *dest, const char *src, size_t count)
> > {
> > char *tmp = dest;
> >
> > while (count && *src) {
> > *tmp++ = *src++;
> > count--;
> > }
> >
> > if (count) {
>
> Good idea... bad to do so many checks if count is zero. On the
> other hand, if count is rarely zero, then it's a loss. Maybe
> benchmark with and without?

Actually, I think this change would be lost in the noise in
the measurements.

> > size_t count2;
> >
> > while (count & (sizeof(long) - 1)) {
> > *tmp++ = '\0';
> > count--;
> > }
> >
> > count2 = count / sizeof(long);
>
> I know that a good compiler should migrate code to help the CPU
> pipeline, but how about moving this "count2 = " line up to before
> the first fill loop. See if that helps any. Always good to
> precompute well in advance.

Cannot do that as the first loop modifies count.

> > while (count2) {
> > *((long *)tmp)++ = '\0';
> > count2--;
> > }
> >
> > count &= (sizeof(long) - 1);
>
> And move this to before the middle fill loop.

In my later implementations this is not possible any longer.

> > while (count) {
> > *tmp++ = '\0';
> > count--;
> > }
> > }
> >
> > return dest;
> > }


> Daniel Forrest wrote:
> > On Sat, Aug 16, 2003 at 10:15:14AM +0200, Peter Kjellerstedt wrote:
>
> > Shouldn't this be:
> >
> > while (tmp & (sizeof(long) - 1)) {
> >
> >
> >> *tmp++ = '\0';
> >> count--;
> >> }
>
> Oh, yeah! That's right. We need to check the address. Also need to
> cast tmp to (int) or something (doesn't matter what it's cast to,
> because we only care about the lower 2 or 3 bits).
>
> Peter, please see if this makes any speed difference. But it
> definately needs this fix.

Yes, I added it in my later versions.

> Frankly, I'm surprised it works. In fact, it might not, but
> it's hard to tell from the tests just benchmarks.

I actually added verification to my benchmarking program too,
so the later versions I mailed are verified to work the same
as the standard glibc implementation at least.

> Also, if you're doing dense addressing on Alpha, and you do byte
> accesses the addresses for char are byte addresses, but the code
> does read-modify-write to memory for byte accesses, because in
> that mode, you can only do 32-bit and 64-bit accesses. The
> performance increase could be even greater for Alpha than for x86.
>
> For Sparc, we might be able to do something with VIS instructions,
> although I don't know what the setup overhead is. Sun's memcpy and
> memset only use VIS when the size is greater than 512, because
> otherwise, it's not worth it.
>
> I don't know enough about PowerPC other than the proper use of the
> "eieio" instruction. :)

Remember that many architectures already have their own architecture
specific implementations. Also note that most of them have not been
modified to do the null-padding... The following architectures have
their own implementations: alpha, h8300, i386, m68k, m68knommu, mips,
ppc, ppc64, s390 and sh. The following use the generic implementation:
arm, arm26, cris, ia64, mips64, parisc, sparc, sparc64, um, v850 and
x86_64.


> Daniel Forrest wrote:
>
> >
> > - if (count) {
> > + if (count >= sizeof(long)) {
> > size_t count2;
> >
>
> I like this size check here, but the comparison should be to
> some number greater than sizeof(long). There is a threshold
> below which it's not worth it to do all of the extra loops.
> If you're going to fill only four bytes, it's probably best
> to do it just using the last loop.
>
> Maybe through some trial and error, we could determine what that
> threshold is. I'm betting it's something around 2* or 3* word size.

The problem is that this probably varies a lot between different
architectures, and also processor speeds. Probably best left for
the architecture specific implementations.


> Peter Kjellerstedt wrote:
> > For unaligned source or destination the "Multi copy & fill"
> > would degenerate into "Multi byte fill". However, for
> > architectures like ix86 and CRIS that can do unaligned long
> > access, it would be a win to remove the UNALIGNED() check,
> > and use long word copying all the time.
>
> In fact, it's possible to do the copy even if the source and
> dest are not aligned. It requires holding pieces of source in
> a register and doing shifts. If the CPU is much faster than
> the memory, this can be a huge win.

This is true. However, this too is probably best left for the
architecture specific implementations.

> > Then whether using memset() or your filling is a win depends
> > on the architecture and how many bytes needs to be filled.
> > For a slow processor with little function call overhead (like
> > CRIS), using memset seems to be a win almost immediately.
> > However, for a fast processor like my P4, the call to memset
> > is not a win until some 1500 bytes need to be filled.
>
> What is in memset that isn't in the fill code I suggested?

In the CRIS case it sets 12 registers to zero and uses the
movem instruction to copy them to memory at once (movem is
normally used to store/restore registers to/from the stack
on function entry/exit). Thus it can clear 48 bytes each
time through the loop. And of course the rest of the function
is hand crafted to give the best performance for any count.

//Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/