Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)

Shane R. Stixrud (shane@souls.net)
Sun, 2 May 1999 03:52:39 -0700 (PDT)


I E-mailed Mark Russinovich and copied him on the "[OT] Comments to WinNT
Mag !!" thread and suggested he respond. He sent me his response and
requested that I forward it onto the list. Response below.

---------- Forwarded message ----------
Date: Sun, 02 May 1999 06:13:00 -0400
From: Mark Russinovich <mark@sysinternals.com>
To: "Shane R. Stixrud" <shane@souls.net>
Subject: Re: [OT] Comments to WinNT Mag !! (fwd)

Hi Shane,

Please post my response to the list.

At 01:24 PM 5/1/99 , you wrote:
>
>'asynchron IO'
>--------------
>
>first he claims Linux has only select(), and then he continues to bash
>select(). (without providing measurements or benchmark numbers) Then he
>says that Linux _does_ have asy nchron IO events implemented in 2.2 but
>says that they have 'two major limitations'. Both 'limitations' he
>mentions are in fact a pure implementation matter and not a mechanism or
>API limitation. Mark also forgot to mention that Linux asynchron IO is
>superior to NT because we do not have to poll the completion port for
>events, we can have the IO event delivered _immediately_ to the target
>thread (which is preempted by a signal if it's running). This gives more
>flexibility of using asynchron events. (i have pointed out this difference
>to him in private discussions, he left this argument unanswered)
>

Completion ports in NT require no polling and no linear searching - that,
and their integration with the scheduler, is their entire reason for
existence. Also, Linux's implementation of asynchronous I/O only applies to
tty devices and to *new connections* on sockets - nothing else. Sure
asynchronous I/O can be added to the rest of the I/O architecture (all of
the deficiencies I bring up can, and I'm sure will, be addressed). My point
is that it is currently very limited.

>'overscheduling'
>----------------
>
>here he again forgets to _prove_ that overscheduling happens in Linux.
>Measurements have been done on big busy Linux webservers (much bigger than
>the typical 'enterprise' category), and the runqueue lenghth (number of
>threads competing for requests) was 3-4 typically. Enuff said ...
>

Under high load environments even the short run-queue lengths you refer to
are enough to degrade performance. And in the environments I'm talking
about where there are several hundred requests being served concurrently,
the run queue lengths for Linux are significantly higher with the
implementation of a one-thread-to-one-client server model.

>'kernel reentrancy'
>-------------------
>
>his example is a clear red herring. If any Linux application is
>read()/write() intensive to the page cache, it should better use mmap(). I
>can understand Mark did not mention mmap(), NT has a rather inferior
>mmap() implementation. (eg. read()/write() and mmap()-ed modifications
>done to the same file are not guaranteed to be data-coherent by NT ...)
>His threading point is correct, there is still code left to be threaded
>for SMP operation. Just as NT has one single big lock in it's networking
>stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
>beta status.)
>

First, serialization of long paths through the kernel degrade
multiprocessor scalability - this is multiprocessing 101.

You mention mmap, and I'm assuming you do so as an alternative to sendfile.
Using mmap to serve files, the following is required:

- the file is mapped with a call to mmap(). The kernel must manipulate the
page tables of the process performing the map.
- the process calls writev() to send an HTTP header in one buffer and file
data from the mapped memory. This is another system call and two copies.

There are 1-3 system calls (depending on whether the requested file has
already been mapped, or another file must be unmapped to make room for the
new mapping via mmap) , 2 buffer copies, and manipulation of the process
page tables. The process must also manage its own file cache, unmapping and
mapping files as needed. The file system is also performing the same
management of the file system cache.

BTW This isn't related to read-only file serving, but Linus admits that
mmap in 2.2 has a flaw where write-backs to a modified file result in two
copies instead of 1. He says that this will probably be fixed in 2.3.x.

On the other hand, Sendfile on NT, Solaris, HP/UX and AIX are used as follows:

- one call to sendfile() is made, and the call specifies buffers that
serve as a prologue (e.g HTTP header) and epilogue to the file data, in
addition to a file handle. The TCP stack sends the file data directly from
the file system cache as a 0-copy send. The user buffers are also sent with
the file data, and are not copied from user space, but locked into physical
memory for the duration of the send.

This implementation has 0 buffer copies and requires 1 system call to send
an entire HTTP response. There is no manipulation of process address space,
and the server need not manage its own file cache. In addition, the call
can be made asynchronously, where waiting is done on a completion port that
is waiting on new connections and more requests on existing connections.
The asynchronous I/O model in NT extends to all I/O. NT (and Solaris,
HP/UX, AIX) also have another API that Linux doesn't have yet: acceptex
(the name of the NT version). This API is used to simultaneously perform an
accept, the first read(), and geetpeer() in one system call. The advantages
should be obvious.

As for the Linux implementation of sendfile(), it does not support adding a
header and the Linux TCP stack does not support 0-copy sends. Thus, there
is an extra system call and buffer copy for a write() to send the header,
and an extra buffer copy for sending the file.

>'sendfile'
>----------
>
>sendfile() is a new system call. The copying problem he noticed is true,
>but it's a matter of the networking code, not some conceptual problem with
>sendfile(). If the networking code does zero-copy then sendfile() will do
>zero-copy as well. (without the user ever noticing) sendfile() will
>certainly be further optimized in 2.3.
>

Just to clarify, the Linux TCP/IP stack does not support 0-copy sending.
See tcp_do_sendmsg() in net/ipv4/tcp.c. Note the calls to
xx_copy_from_user() (the copy functions are macros defined in the
architecture-specific include file uaccess.h).

Like I said, I'm sure that over time the Linux problems will be fixed, but
my article was about the state of Linux *today*, not next year or the year
after.

>in private discussions with Mark i have pointed out most of these
>counter-arguments, which he unfortunately failed to answer. He also didnt
>answer my questions about NT's shortcomings in the above areas. (as
>always, seemingly powerful concepts can often open up ugly ratholes)
>Different OS, different approach. Let the numbers talk.
>

I try to answer all e-mail that raise technical issues. If I failed to
answer yours, Ingo, then it was simply because I was too busy.

-Mark

Mark Russinovich, Ph.D.
NT Internals Columnist, Windows NT Magazine
http://www.winntmag.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/