Re: Speed of memcpy, csum_partial and csum_partial_copy

Robert L Krawitz (rlk@tiac.net)
Sat, 8 Jun 1996 17:00:41 -0400


Date: Sat, 8 Jun 1996 20:21:15 +0300 (EET DST)
From: Linus Torvalds <torvalds@cs.helsinki.fi>

643 23.78% 00191324 csum_partial_copy_fromuser
997 36.88% 001369c8 memcpy_toiovec
2703 100.00% 00000000 total

That's very, very interesting. Somehow the checksum routine was
faster than the raw memcpy routine. There's nothing in the code to
indicate to me why that should be the case. Possibly most of this
stuff was 2-byte aligned, and the checksum routine handles it properly
while memcpy doesn't? Looks like something I should fix in my memcpy
routine. I would guess that memcpy_toiovec is taking about 60% longer
than necessary here.

In short, the two copies that occur in TCP loopback (first from the
sender into the kernel, and then from the kernel into the receiver) alone
account for 60% of the TCP stack..

The checksum routine is crying out for an implementation in the FPU.
Unfortunately, I don't see a way to do it offhand since the checksum
routine is specified as a sum of 4-byte words rather than 8-byte
(which, for that matter, the FPU couldn't really handle either). I
haven't seen the MMX spec, but I wouldn't be the least bit surprised
if we can do something very nice on the P55C.

(This is on a P166 with a reasonably good memory subsystem, and the
machine was sending 500MB of data over TCP loopback).

I presume that means EDO RAM. If so, I'm guessing that
csum_partial_copy_fromuser was running at about 45 MB/sec, and hence
your overall throughput was something like 11 MB/sec?

-- 
Robert Krawitz <rlk@tiac.net>           http://www.tiac.net/users/rlk/

Member of the League for Programming Freedom -- mail lpf@uunet.uu.net Tall Clubs International -- tci-request@aptinc.com or 1-800-521-2512