> - isnt it mmap that should be used to implement zero-copy
The net code folds copy and checksum so the user->kernel copy is very close
to free (it is free for most people unless there is a lot of bus activity)
(Those who don't feel like having a quick lesson in Sparc assembly
optimization skip to end to see why this is so relevant anyways.)
It is more than free on the Sparc I have found with 1000 hit/sec
detailed to the instruction profiling information sampled during a 2gb
TCP transfer. In cases where the memcpy() code would completely stall
(and thus clear out the entire pipeline) the csum/copy code is filling
the stalls in with "useful" work, this is especially true with chips
which lack a store buffer or worse lack write-allocate on the cache.
Basically what happens (at least on the sparc) is you usually have for
generic fast memcpy() sequences like:
ldd [%src + offset + 0x00], %reg0
ldd [%src + offset + 0x08], %reg2
ldd [%src + offset + 0x10], %reg4
ldd [%src + offset + 0x18], %reg6
std %reg0, [%dest + offset + 0x00]
std %reg2, [%dest + offset + 0x08]
std %reg4, [%dest + offset + 0x10]
std %reg8, [%dest + offset + 0x18]
Even if you take no cache misses you stall like crazy because the chip
can't pair any of those instructions (but it is still the fastest way
on the Sparc to copy since there is nothing else to do). So the
pipe is never going at all, it's always just one instruction
trickling into it but thats it. Add to this that it's really hard to
get passed both src and dest which can be both double word aligned
(ie. !((src ^ dest)&3)). For the csum_partial_copy() case you want to
avoid keeping extraneous state around to know that you have to "oh
yeah, swap the low bytes at the end". So what I do on Sparc is I only
try to get them both word aligned, this is almost always easy, and
thus the unrolled loops are sequences like the following:
ldd [%src + offset + 0x18], %t0
ldd [%src + offset + 0x10], %t2
ldd [%src + offset + 0x08], %t4
st %t0, [%dest + offset + 0x18]
addxcc %t0, %accum, %accum
st %t1, [%dest + offset + 0x1c]
addxcc %t1, %accum, %accum
st %t2, [%dest + offset + 0x10]
ldd [%src + offset + 0x00], %t0
addxcc %t2, %accum, %accum
st %t3, [%dest + offset + 0x14]
addxcc %t3, %accum, %accum
st %t4, [%dest + offset + 0x08]
addxcc %t4, %accum, %accum
st %t5, [%dest + offset + 0x0c]
addxcc %t5, %accum, %accum
st %t0, [%dest + offset + 0x00]
addxcc %t0, %accum, %accum
st %t1, [%dest + offset + 0x04]
addxcc %t1, %accum, %accum
First ldd stalls, who cares because we usually stall to bring in the
cache line anyways so I don't try to pair it with anything. Second
and third ldd eat one cycle each, tops. The first st/addxcc can stall
for the cache line, but if not then it and the second st/addxcc pair
there will go in one clock each. The st and ldd pair next each eat
one clock a piece, then we get fully pairing st/addxcc's which do not
miss the cache going to the end. Result? I get the csum calculation
for free as alan said. ;-)
Relevance:
What can all this crap buy you, and how much? You'd be
surprised. As an example, the localhost TCP bandwidth test of lmbench
reported 4.0mb/s with SparcLinux and 7.0mb/s under Solaris2.5 before I
started hacking the csum code for the Sparc. How does solaris get
that number?
Well, they csum/copy too, so that helped them (previously my
code did what the Alpha port does now which is do the csum then call
memcpy() which is very slow). But on top of all that, they optimize
in a rediculious manner by cutting off things at the ip level if it
sees that it's TCP going over loopback and just copies the data into
the receiving users buffer. Solaris does _zero_ networking at all for
localhost TCP. So the bw_tcp of lmbench is _really_ measuring how
fast Solaris can memcpy(). ("ooohhh ahhh, it'll make local X clients
faster, ohhh ahh...")
Impossible to get that with Linux doing real networking? I
think not, I'm almost there, as of right this second I'm getting
6.25mb/s with only minor tweaks to the csum/copy code I have, I have
yet to do one last huge optimization to the memcpy() code and one last
minor speed improvement to the csum/copy code, so I should be able to
surpass Solaris2.5 I think.
TO REITERATE:
full Linux networking overhead == Solaris2.5 memcpy()
QED
;-)
Makes you wonder why everyone bases their stuff on the berkeley stack,
it sucks rocks from a performance standpoint, and the speed problems
are inherent in it's overall design and structure (read this as: mbufs
are fundamentally flawed)... but we have some buglets left in our
stuff so maybe I should cut myself short right here ;)
Later,
David S. Miller
davem@caip.rutgers.edu