Re: unusual startup messages

Linus Torvalds (torvalds@cs.helsinki.fi)
Fri, 1 Nov 1996 22:32:47 +0200 (EET)


On Fri, 1 Nov 1996, Craig Milo Rogers wrote:
> >Yes, you can do it in user space, but your performance will suck unless you
> >ignore security issues or have special hardware (and the special hardware
> >would essentially have to do 90% of the stuff we do in kernel space now: I'm
> >talking _really_ special hardware).
>
> Actually, the hardware needn't be that special. It mainly
> needs to identify "normal" incoming TCP and UDP packets, and store
> them (via DMA or shared memory) directly into buffers that are mapped
> into the corresponding user processes; this may be done with a
> high-speed state machine and socket lookup hash table.

I disagree.

Yes, you can get _some_ networking going that way, but you sure as hell
aren't going to get UNIX semantics. Another favourite pastime of
microkernels: "we can't do that efficiently, so let's change the rules and
_then_ we can show how much faster we really are".

Think of this very simple program:

/* fd 0 is a TCP socket */
read(0,buffer,1);
if (buffer[0] == 'a')
execve("/some/fine/program",..);

/some/fine/program:

/* expect to be able to read the rest of the TCP stream */
read(0, new_buffer, 100);

Now, it doesn't _matter_ if the card knows TCP/IP and can demultiplex the
packets and in general be unbelievably clever. UNIX semantics require that
the first process read exactly _one_ byte, and the new process (that didn't
even exist when the TCP packet came in) will be able to read the rest.

"Directly storing them into the user process" doesn't even come _close_ to
what the hardware has to know about.

The card needs to buffer the packets indefinitely (maximum memory
requirements: max_nr_of_tcp_connections*TCP_WINDOW_SIZE - we're talking at
least hundreds of kB, probably a few MB here, and that's just for incoming
data), and it needs to be able to store partial packets too.

> It is also
> desirable to have the hardware enforce certain security and sanity
> checks on outgoing packets; this can be done with a template
> mechanism. Finally, it is desireable to have the hardware calculate
> the TCP/UCP checksums, since there's no longer a kernel-level copy in
> which to hide the calculation.

The hardware also has to do retransmission and TCP sequence numbers, because
you also can have the case where you have multiple processes writing data to
the same TCP connection and the only common point is the hardware, then the
hardware has to do all the sequencing..

Of course, you can argue that you can have multiple processes sharing the
same "TCP description" in shared memory, and then they can argue over the
description among themselves and only give the hardware a "fait accompli".
But then all the user processes have to be nice about it, and you can't have
disagreements (or you'll end up with incorrect sequence numbers). And the
only way you can guarantee that processes are nice about it is by protecting
this shared memory region some way. Voila - you're back at a kernel.

Of course, that's another thing some "research" projects do: they assume that
all user programs are nice and you don't need any kernel protection. Yeah,
sure.

> The hardware functions correspond to part of the the "fast
> path" of a high-performance Internet stack. The rest of the Internet
> stack can be implemented in user space.

Nope.

You can (reasonably) trivially implement a N:1 mapping of TCP streams to
processes. That's not hard. But implementing a N:M mapping of TCP streams
that are used by multiple processes is not just a simple user space
implementation. All the user spaces have to know about _other_ user spaces
writing to the same stream (or reading from the same stream), and agree on
who uses what write sequences or who reads what packets. Otherwise you'll
have chaos.

> The Netstation and Atomic-2 projects at ISI believe it is
> possible. (Netstation is primarilly directed at Internet-adressible
> peripherals; think of your processor, display adaptor, and disks
> as each having their own IP addresses. Atomic-2 has been investigating
> user-level protocol APIs.)

Do they also consider UNIX semantics? Or are they another of those
"specialized" systems that don't care about little things like that?

(Yes, I despise research projects that show good numbers, and then it turns
out they show good numbers for doing something much more limited than the
real world. That's not science, that's just bad research and doctoring your
numbers).

> I'm not really up on the status of these projects, but I
> believe that Atomic-2 has demonstrated (non-IP) user-level protocol
> stacks operating in excess of 200 Mbps on Sun SPARC-20/71s. It is
> believed (but has not, to my knowledge, been demonstrated) that the
> same performance can be obtained for TCP- and UDP-based stacks.

Did you know that it's possible to travel faster than light?

Take a (LARGE, POWERFUL) flashlight, turn around in circle real quick and
shine the flashlight outwards while turning. Wait a year, and take a look at
where the light is. It's out there, revolving around you at a distance of one
lightyear, and it's making a complete circle in less than a second (assumed
you turned around quickly enough). Wow! The lightspot is moving much faster
than light!

Now, bring in a physicist, and he'll start crying when you tell him the
story. Yes, the lightfront is "kind of" moving at faster than lightspeed, but
no _information_ is moving that fast.

Now, the same is true of some of these research projects: they "kind of" do
the same thing as a real operating system, and they can even do it faster.
But in the end it's not really the same thing at all.

(I agree, extremely bad analogy. Sorry about that. My point is just that
there is "networking" and there is "networking", and they don't necessarily
mean the same thing. I do not believe you can do UNIX networking in user
space without unrealistically clever hardware or some unrealistic definition
of "user space" ;)

Linus