Re: Speed of memcpy, csum_partial and csum_partial_copy

Tom May (ftom@netcom.com)
Wed, 12 Jun 1996 13:29:17 -0700


>My other point is that csum_partial_copy looks like it could run 33%
>faster, when everything is in the CPU cache, by rearranging the
>instructions in the loop to pair fully -- I think that adcl can pair but
>only in the U pipe. (I could be wrong about this. If adcl can pair
>anywhere, there are still write-then-read dependencies that prevent
>pairing in that code).

Yes, adcl has to go in the U-pipe. In theory it could go faster by
using another register to make everything pair nicely:

movl (%%esi), %%ebx
movl 4(%%esi), %%ecx
adcl %%ebx, %%eax
movl %%ebx, (%%edi)
adcl %%ecx, %%eax
movl %%ecx, 4(%%edi)

But theory doesn't mean diddly here. Do it and time it. Ideally,
%ebp would be available but that assumes -fomit-frame-pointer and may
have had some other problems when I tried it.

Some history: csum_partial_copy() didn't used to exist. In the
beginning (well, of 1.3 anyway) there was csum_partial_copy_fromuser(),
which used an inner loop similar to the above. But, the segment
overrides botched the pipelining beyond redemption (six instructions,
four of which must go in the U-pipe, leaving two to overlap and two
empty V-pipe positions) so I tweaked it to its present form which
didn't require push/pop of %ecx. Then, csum_partial_copy() was
(apparently) derived from csum_partial_copy_fromuser() by the obvious
simplification of omitting fs: without regard to Pentium pipelining.

So go for it!

Tom.