Re: Fw: Some very thought-provoking ideas about OS architecture.

shapj@us.ibm.com
Mon, 21 Jun 1999 13:41:22 -0400


>There seems tremendous vagueness in this. ... I guess all its
>TCP connections will suddenly show an error and die, and the software
>must reconnect to the rest of the universe. Right? What then happens to
>other stateful external interactions? Do you have some mechanism that
>will cause all external stateful activities to die, so they don't
>continue in an undefined way?...

Apologies if I was vague. Let's try for more clarity this time:

On restart, EROS ensures that all capabilities to external connections,
including devices and network connections, are rescinded before any application
is permitted to execute. Invocations on a rescinded capability generate a
well-defined return value that can be used by the application to detect that the
capability is no longer valid. Thus, there is no possibility of replay to
devices unless the application goes to considerably lengths to reopen the device
and replay.

In general, where external devices are concerned, there exists no way to tell
them to stop operating unless hardware makes a provision for an external reset
line that can be activated on restart. There is a consistency issue inherent in
the possibility of any system crash that has nothing to do with whether one
machine or the other is persistent.

> Persistence is a curse, not a benefit, when applied to these things.

We haven't found that to be true, and I'ld be interested to learn why you feel
it is so. To the application, things appear no different than if the connection
to the device was severed by (e.g.) a network failure.

In any case, this really begs the question. The question is: given that both
persistence and non-persistence have pros and cons, which is to be preferred as
the default mode of behavior? The answer depends heavily on the overall system
architecture; simply grafting one answer or the other onto an existing system
won't do. EROS has obviously taken the view that persistence is our preferred
default for reasons of consistency management (which I outlined in one of my
early notes). I think our benchmarks suggest that the "best" resolution is not
at all obvious, and may even lean in favor of persistence.

>This description is very write oriented. What about reads? In a real
>world situation the machine doesn't just sit there processing, changing
>its data, and needing writes to make it persistent. Most applications
>manage a large data set, and are forever hopping around that data.

I don't know that it's "most" applications, but certainly it's true for many
very important ones.

The answer to your question is mildly involved.

First, the clustered allocation strategy really does work. In practice, it
allocates tracks rather than objects. On modern drives the track to track time
is the same as the head to head time, so the benefit of sequential heads vs.
localized tracks becomes very tricky to understand, especially in the face of
drive-level caching. The old UNIX block allocation models are totally
inadequate. On this basis, I'ld say that our default behavior is about as good
as one can get from an extent-based file system where extent==one track.

A few applications, of course, need better control than this. For this purpose
we have architected (but not implemented) a storage allocator that guarantees
*physically* contiguous clusters on the disks. I submit that such applications
are quite rare, and that their performance is sufficiently critical that minor
customizations of allocation strategy are entirely appropriate.

>If the RAM is much smaller than the disk (like in anything but a
>TPC benchmark test) there will be lots of disk reading. In
>most systems this is through explicit reads. I assume in Eros
>there is a tremendous number of page breaks and VM swaps (am
>I wrong?). Doesn't this rather pull the large disk block
>operation concept apart?

The equivalent in EROS is page faults. The design permits I/O to be usefully
done in extents, and this would plug very naturally into the current design. We
have no data one way or the other about cost/benefit on this. There is also an
architected (not implemented) non-faulting page touch that can be used to
perform application read-ahead.

The large block I/O claim concerned writes rather than reads. I agree with the
argument that you are implicitly making, which is that read prediction is
exceedingly important in large apps. On the other hand, my guess is that apps
for which read I/O is a big issue either exhibit linear fetch or are already
designed to manage their own I/O explicitly.

Again, we have no hard data on this unless Charlie Landau has experence from
KeyTXF that he can share.

I think I answered your question about read/write clustering vs. modern drives
above. If I didn't answer it to your satisfaction, please do follow up.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/