Re: Ideas for reducing memory copying and zeroing times

Jamie Lokier (jamie@rebellion.co.uk)
Wed, 17 Apr 96 16:27 BST


>>>>> "Robert" == Robert L Krawitz <rlk@tiac.net> writes:

Robert> The best way that I've found (which is twice as fast as
Robert> anything else) is to use the FPU to zero out pages. I get
Robert> 60 MB/sec throughput that way (15000 pages/sec) vs. 30
Robert> MB/sec by any other way. The problem with memory writes
Robert> from the Pentium is that they only go 32 bits at a time
Robert> unless you're flushing a cached line. Since the Pentium
Robert> cache is not write allocate, if you write to an uncached
Robert> location, it writes through to the location.

Everything I've read indicates that the Pentium will pair memory
accesses provided the low-order bits of the addresses are different. I
haven't actually timed any code, of course. :-) Thus you get to access
64 bits per cycle (two 32-bit instructions). The FPU memory
instructions aren't pairable, so they can't do any better than that.
Doesn't this apply to writes, or to writes to uncached locations?

Robert> Actually, I don't think a write allocate cache is very
Robert> effective for memory copy and block zero. The reason is
Robert> that write allocate pulls the data in from main memory,
Robert> which is a waste when copying or clearing memory. Better to
Robert> have a Pentium-type cache and use a 64-bit instruction
Robert> (fistpq or fstd or the like).

That does sound like the best idea so far. It would be preferable to
get a cache-line's worth written using a single, pipelined burst.
Unfortunately, I can't think of any way to make that happen.

Are there any bus-mastering DMA devices that could be persuaded to do
zero-filling without using the CPU? Any PCI thing to do it?

-- Jamie Lokier