Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000

From: Todd Underwood (todd@osogrande.com)
Date: Thu Sep 12 2002 - 08:57:04 EST


jamal,

> Good work. The first time i have seen someone say Linux's way of
> reverse order is a GoodThing(tm). It was also great to see de-mything
> some of the old assumption of the world.

thanks. although i'd love to take credit, i don't think that the
reverse-order fragmentation appreciation is all that original: who
wouldn't want their data sctructure size determined up-front? :-) (not to
mention getting header-overwriting for-free as part of the single copy.

> BTW, TSO is not a intelligent as what you are suggesting.
> If i am not mistaken you are not only suggesting fragmentation and
> assembly at that level you are also suggesting retransmits at the NIC.
> This could be dangerous for practical reasons (changes in TCP congestion
> control algorithms etc). TSO as was pointed in earlier emails is just a
> dumb sender of packets. I think even fragmentation is a misnomer.
> Essentially you shove a huge buffer to the NIC and it breaks it into MTU
> sized packets for you and sends them.

the biggest problem to our approach is that itis extremely difficult to
mix two very different kinds of workloads together: the regular
server-on-the-internet workload (SOI) and the large-cluster-member
workload (LCM). in the former case, SOI, you get dropped packets,
fragments, no fragments, out of order fragments, etc. in the LCM case you
basically never get any of that stuff--you're on a closed network with
1000-10000 of your closest cluster friends and that's just what you're
doing. no fragments (unless you put them there), no out of order
fragments (unless you send them) and basically no dropped packets ever.
obviously, if you can assume conditions like that, you can do things like:
only reassmble fragments in reverse order since you know you'll only send
them that way, e.g.

> In regards to the receive side CPU utilization improvements: I think
> that NAPI does a good job at least in getting ridding of the biggest
> offender -- interupt overload. Also with NAPI also having got rid of
> intermidiate queues to the socket level, facilitating of zero copy receive
> should be relatively easy to add but there are no capable NICs in
> existence (well, ok not counting the TIGONII/acenic that you can hack
> and the fact that the tigon 2 is EOL doesnt help other than just for
> experiments). I dont think theres any NIC that can offload reassembly;
> that might not be such a bad idea.

i've done some reading about NAPI just recently (somehow i missed the
splash when it came out). the two things i like about it are the hardware
independent interrupt mitigation technique and using the DMA buffers as a
receive backlog. i'm concerned about the numbers posted by ibm folx
recently showing a slowdown under some conditions using NAPI and need to
read the rest of that discussion.

we are definitely aware of the fact that the more you want to put on the
NIC, the more the NIC will have to do (and the more expensive it will have
to be). right now the NICs, that people are developing on are the
TigonII/III and, even more closed/proprietary, the Myrinet NICs. i would
love to have a <$200 NIC with open firmware and a CPU/memory so that we
could offload some more of this functionality (where it makes sense).
>
> Are you still continuing work on this?
>

definitely! we were just talking about some of these issues yesterday
(and trying to find hardware sepc info on the web for the e1000 platform
to see what else they might be able to do). patricia gilfeather is working
on finding parts of TCP that are separable from the rest of TCP, but the
problems you raise are serious: it would have to be on an
application-specific and socket-specific basis, so that the app would
*know* that functionality (like acks for synchronization packets or
whatever) was being offloaded.

the biggest difference in our perspective, versus the common kernel
developers, is that we're still looking for ways to get the OS out of the
way of the applications. if we can do large data transfers (with
pre-posted receives and pre-posted memory allocation, obviously) directly
from the nic into application memory and have a clean, relatively simple
and standard api to do that, we avoid all of the interrupt mitigation
techniques and save hugely on context switching overhead.

this may now be off-topic for linux-kernel and i'd be happy to chat
further in private email if others are getting bored :-).

> cheers,
> jamal

t.

-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve neither liberty nor safety." - Benjamin Franklin

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 15 2002 - 22:00:29 EST