Re: User-space Networking (Was; unusual startup messages)

Ingo Molnar (mingo@pc5829.hil.siemens.at)
Sat, 2 Nov 1996 12:09:01 +0100 (MET)


On Fri, 1 Nov 1996, Craig Milo Rogers wrote:

> >> >Yes, you can do it in user space, but your performance will suck unless you
> ...
> >> Actually, the hardware needn't be that special. It mainly
> >> needs to identify "normal" incoming TCP and UDP packets, and store
> ...
> >Yes, you can get _some_ networking going that way, but you sure as hell
> >aren't going to get UNIX semantics. Another favourite pastime of
> ...
>
> You have raised some very cogent points. To save space, I
> have not quoted your message in detail. Instead, I'll focus on some
> overall application and OS design issues involved (and that way I'll
> also feel a little better about using bandwidth on linux-kernel).
>
> The first question to ask is, "Why would I want a user-level
> TCP/IP implementation?" Some reasonable motivations are: "It will
> increase system robustness", "It will increase system security", and
> "It will increase system performance". Of course, whenever you select
> one of these motivations, you have to demonstrate that you actually
> meet the stated goal.

The central concept is 'control'.

we have physical devices which just dont know who uses them. So we either
have to give up control over the actual device, or we have to define some
kind of >enforced< semantics on the device.

It is very very obvious that the concept of 'control' inevitably leads to
an object equivalent to the common term 'Linux kernel'.

Whatever you do, you only implement this control. If you make 'the
hardware smarter', you just add ways to implement part of the semantics
(ie. control) in the hardware. But it's not a theoretical difference.
( well, the hardware is much much harder to fix and performance will
most probably suck anyways and development will be much slower but hey
at least it will cost alot of money )

If you give up control over the phyical world, you >can< achieve better
performance. But dont forget that Linux performance >is< very good. We are
in the kernel in less than 100 cycles on a pentium, and are out of the
kernel in some 50 cycles. >this< is awesome RL performance, with full
control and full Unix semantics.

If you need a marginally faster Web server that mucks with the networking
card directly, then you first make a policy decision that you >trust< all
bits of your server. Ie. you make the server part of your kernel. Thats
already possible, the API is a bit different inside the kernel, but it's
not inherently hard. Of course, non-HTML applications will suffer some
performance degradation, and you will end up putting all of your
user-space code into the kernel: you've invented DOS.

control has it's price. Under Linux you can make mistakes. The kernel is
bulletproof, and this has some price. And Linux has already hit some
physical barriers, the exception code for example :) i dont think you can
do better than 0 cost exception handling =P

[ ok, we actually pay the price of exception handling, because the CPU
already implements parts of control: it uses paging, and happily pay the
price pay with occasional TLB cache misses ]

> I'll select "It will increase system performance", in a
> specific context: a dedicated Web/FTP/NSF/DNS server. In the
> hypothetical dedicated Web server process, for instance, I'll run
> special code whose purpose is to grab HTML data from storage and shove
> it out the net as efficiently as possible. In this instance,
> maintaining campatibility with the existing Unix networking API is not
> terribly important, and may be sacrificed in the interests of overall
> performance.

the kernel API itself doesnt currently really support putting applications
into the kernel directly, but by allowing processes into ring 0 looks like
a good idea. This again is a system policy decision, which has to be
enabled by the system owner. (and which opens up a new set of security
bugs but you want a bit more performance over alot less control).

with processes in ring 0 ... you could hook your process into the
beginning of the networking stack, and you could do a 0 delay
'strcmp(inpacket,"<head>")' on incoming packets.

of course, if you want to stay a general and fair application, you will
end up doing something like we already do.

> "But wait", you may reply. "A Unix system, even a dedicated
> Web server, must run dozens of complicated network-dependent
> processes, and you probably wouldn't have resources to rewrite them
> all!" This is an entirely valid objection, and I must address it.

oh, come on!

> One approach is to implement user-level TCP/IP in the
> dedicated Web/FTP/NSF/DNS server applications while retaining the
> kernel-level TCP/IP for use by more general Unix networking processes.
> Admittedly, now you have *two* (or more) independent TCP/IP stacks to
> maintain, and it doesn't satisfy the goal of the initiator of this
> discussion, which to question whether the TCP/IP stack needs to be in
> the kernel at all. Nonetheless, this approach will probably meet my
> goal of a creating a very high performance Web/FTP/NSF/DNS server.

sure. I think we should implement unswappable ring 0 processes in 2.1,
adn should create a cross user-space/kernel-space API to muck with kernel
policies in a straightforward way. Are there any pitfalls with this
approach? I cant seem to find any. [hey does this make sense? :)]

> Another approach is to implement user-level TCP/IP in the
> dedicated server processes, and have a user-level TCP/IP process act
> as a daemon (ala kerneld) for all other network-using processes. It

argh. No. You cant really do interrupts quite well currently. And if you
want to make it really fast, you will want to access IDE ports directly
too.

Ooops, wait! we again end up inventing DOS again or implementing the Linux
kernel again?

> is fairly easy to imagine how to implement this and preserve full Unix
> semantics for the processes that require them. The cost is fairly
> high, of course, due to the multiple user/kernel/user context switches
> required to pass data and control between processes. However, *if*
> the bulk of system processing is in the dedicated server processes,
> which are not subject to these context switches, then the increased
> overhead in the other processes may be negligible to the system as a
> whole.
>
> Another approach for supporting the "general Unix networking"
> processes is to use a modified libc that intercepts the network calls
> and uses, say, shared memory, to communicate with a seperate (thus
> crash-resistant) TCP/IP daemon process. It may be possible this way
> to avoid some unwanted data copies, although there may still be an
> inter-process IPC call needed to replace many of the the kernel calls
> of a "normal" Unix implementation; as the literature of microkernels
> demonstrates, it is difficult to do this efficiently.

physical fact: kernel entries are >much< cheaper than context switches
which change the page table layout. So instead of doing a single
user-space daemon, you will get much better performance mmap()-ing your
goddam HTML file, and sending that data with write() through a TCP socket.
No copy. (except the inevitable checksumming copy) ...

[ btw: has anyone ever done a 'checksum cache'? We could do per-sector or
per-block checksumming, and since the checksums are additive, we only
have to recalculate the header checksum, add it to the cached checksum,
send out the header and DMA the buffer into the networking card ... zero
copy. This makes sense for mostly read-only stuff like NFS servers of
Web servers and with smart multi-buffer-DMA PCI networking cards ]

> I have used the goal of a "high-performance Web/FTP/NSF/DNS
> server" in my discussion above. I believe I could provide similar
> arguments in favor of a user-space TCP/IP implementation for the goals
> of increasing system robustness or security. Of course, what I have
> provided as "proof" is merely hand-waving; the success of the Internet
> has been based upon performing concrete experiments rather than
> gedanken ones. Nonetheless, I hope that I have supported the point
> that the proper analysis is to ask whether the (purported) benefits of
> a user-level TCP/IP implementation do indeed outweigh the requisite
> costs in a system context, rather than to dismiss the concept as
> infeasible based upon cost alone.

the only decision you can make is to how much control you want to have
over the physical wire in this case. This is the >only< limiting factor,
not Linux's design. Linux is already very close to the physical limits.
[not in all corners of the API ... thank god i must say! :))]

-- mingo