Re: Ideas for reducing memory copying and zeroing times (fwd)

Robert L Krawitz (rlk@tiac.net)
Fri, 19 Apr 1996 09:54:05 -0400


Date: Fri, 19 Apr 1996 04:16:53 +0200
From: Michael Riepe <riepe@ifwsn4.ifw.uni-hannover.de>

I guess the fastest (and shortest) way to do that on an 80x86 is:

xorl %eax,%eax
leal 4096(page_start),%esp
<repeat 1024 times>
pushl %eax
<end repeat>

Of course you will have to turn interrupts off while doing this, and
you have to save and restore %esp - but it's twice as fast as your
move-and-increment procedure (assuming that push, mov and increment
operations each take 1 clock cycle - that's what my i486 documentation
says) and takes only 1024+ bytes of code space. Your code takes at
least(?) 4 bytes of code per word cleared, assuming 32-bit protected
mode:

The cycle count for something like this is rarely of interest (unless
the number of machine cycles exceeds a main memory cycle, which is
typically around 100 ns). Anything like copying or zeroing a large
block of memory is completely dominated by the memory accesses, not by
the machine instructions.

The important consideration is usually the width of the memory bus.
Almost all 486 and 386dx systems have a 32 bit main memory bus, so
using a 32 bit wide instruction will be efficient. Whether it's some
convoluted sequence as above or rep stosd probably doesn't matter too
much.

The Pentium's a different story. It has a 64 bit data bus, and most
Pentium systems (except some laptops and a few bottom end desktop
systems) use a 64 bit main memory bus. It also has a write back
cache, but it doesn't allocate a cache line on write (if a particular
address is not cached, it writes through). This offers a number of
possible strategies:

1) Use a conventional method (rep stosd or rep movsd). This only
writes 32 bits at a time, and since the destination is normally not
cached (if we're trying to zero out a really large chunk of memory) it
won't be in cache, so we only use half of the memory bus on the write
cycle. Bad, especially on block zero.

2) Preload the cache (a chunk at a time). This makes writes more
efficient, since the data's written from the cache 64 bits per cycle,
but it requires an extra read. This speeds up memcpy by about 10%, at
least on my system.

3) Use 64 bit instructions. There are very few 64 bit instructions on
the x86 (cmpxchg8b and the FPU instructions). The FPU instructions
are usable for this purpose, since there are integer instructions and
the FPU registers are wide enough to hold a 64 bit integer with no
loss of precision (read: corruption). They're slow (2-6 cycles), but
that's OK even on a 90 MHz Pentium since 6 cycles is still quicker
than a main memory cycle. I've found that performance using this
technique is almost exactly double on block clear (60 vs. 30 MB/sec)
and sharply improved on memcpy (35 vs. 19 MB/sec). This is a 90 MHz
Pentium on an Intel Plato motherboard. I have a patch (see
http://www.tiac.net/users/rlk/linux.html for details) that uses this
method.

There are a few high end systems that have a 128 bit wide main memory
bus. I suspect that it would be better to preload the cache on these
systems than to use 64 bit stores, although the numbers suggest it
would be a toss up. If this were the case, I would use conventional
32-bit instructions with cache preloading rather than the FPU.

-- 
Robert Krawitz <rlk@tiac.net>           http://www.tiac.net/users/rlk/

Member of the League for Programming Freedom -- mail lpf@uunet.uu.net Tall Clubs International -- tci-request@aptinc.com or 1-800-521-2512