Re: [PATCH] optimize ia32 memmove

From: Andreas Dilger
Date: Tue Dec 30 2003 - 03:14:57 EST


On Dec 30, 2003 02:56 -0500, Jeff Garzik wrote:
> Andrew Morton wrote:
> > Jeff Garzik <jgarzik@xxxxxxxxx> wrote:
> >
> >>Linux Kernel Mailing List wrote:
> >>
> >>>ChangeSet 1.1496.22.32, 2003/12/29 21:45:30-08:00, akpm@xxxxxxxx
> >>>
> >>> [PATCH] optimize ia32 memmove
> >>>
> >>> From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
> >>>
> >>> The memmove implementation of i386 is not optimized: it uses movsb, which is
> >>> far slower than movsd. The optimization is trivial: if dest is less than
> >>> source, then call memcpy(). markw tried it on a 4xXeon with dbt2, it saved
> >>> around 300 million cpu ticks in cache_flusharray():
> >>
> >>[...]
> >>
> >>>diff -Nru a/include/asm-i386/string.h b/include/asm-i386/string.h
> >>>--- a/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> >>>+++ b/include/asm-i386/string.h Mon Dec 29 23:13:20 2003
> >>>@@ -299,14 +299,9 @@
> >>> static inline void * memmove(void * dest,const void * src, size_t n)
> >>> {
> >>> int d0, d1, d2;
> >>>-if (dest<src)
> >>>-__asm__ __volatile__(
> >>>- "rep\n\t"
> >>>- "movsb"
> >>>- : "=&c" (d0), "=&S" (d1), "=&D" (d2)
> >>>- :"0" (n),"1" (src),"2" (dest)
> >>>- : "memory");
> >>>-else
> >>>+if (dest<src) {
> >>>+ memcpy(dest,src,n);
> >>>+} else
> >>> __asm__ __volatile__(
> >>> "std\n\t"
> >>> "rep\n\t"
> >>
> >>Dumb question, though... what about the overlap case, when dest<src ?
> >> It seems to me this change is ignoring that.
> >>
> >
> >
> > "if dest is less that source, then call memcpy". If the move is to a
> > higher address we do it the old way.
>
>
> I'm confused... that doesn't say anything to me about overlap.
>
> They can still overlap: Consider if dest is 1 byte less than src, and
> n==128...

The non-overlapping cases are probably very common and worth optimizing for:

if (dest + n < src || src + n < dest) {
memcpy(dest, src, n);
} else {
"old way"
}

People generally call memmove() if they _think_ the areas might overlap,
but if memcpy() can be so much faster it is probably worth catching the
cases where the two areas don't actually overlap. The above assumes that
"dest + n" and "src + n" don't wrap, but that is probably a broken case
in the current code so not worth adding extra checks in for.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/