Re: Some very thought-provoking ideas about OS architecture.

Eric S. Raymond (esr@snark.thyrsus.com)
Sun, 20 Jun 1999 08:39:05 -0400


(Apologies for losing the thread ID. Alan's mail to me bounced.)

Alan Cox writes:
> > EROS is built around two fundamental and intertwined ideas. One is
> > that all data and code persistence is handled directly by the OS.
> > There is no file system. Yes, I said *no file system*. Instead,
> > everything is structures built in virtual memory and checkpointed out
> > to disk every so often (every five minutes in EROS). Want something?
> > Chase a pointer to it; EROS memory management does the rest.
>
> This is actually an old idea. The problem that has never been solved well is
> recovery from errors. You lose 1% of your object store. How do you tidy up.
> 20% of your object store dies in a disk crash, how do you run an fscobject
> tool. You can do it, but you end up back with file system complexity and
> all the other fs stuff.

Accepting your analysis, it still seems to me there's a difference,
though. In an EROS-like world, you would only pay the complexity cost of doing
fscobject-like things in the postmortem analyzer that's trying to
stitch together the remaining pieces. You wouldn't have to pay that same
cost in the kernel for each and every access to persistent stuff; no
namespace management to worry about.

So, yes. An EROS-like architecture has the same error-recovery
problem that fsck addresses. But it appears to me, at least as far as
I've taken the logic, that that problem would be better contained than
in a Unix-like system.

> Another peril is that external interfaces don't always like replay of events.

A much more serious objection, I agree.

> You still end up with a lot of your objects having checkpoint/restart aware
> methods.

Yes, I grant that's true. (The way I'd put it is that you still need
something like commit/rollback in database-land.) But this is a solvable
problem. Butler Lampson showed years ago how to do provably correct
serialization of access to shared critical regions with timestamps
even in the absence of reliable locks. So as long as your
hypothetical user can't futz with the system clock...

> Moving just some objects between systems is fun too. You then get into
> cluster checkpointing, which is a field that requires you wear a pointy hat,
> have a beard and work for SGI or Digital.

Not something I have opinions about -- or am qualified to. :-)

> Their numbers are for a microkernelish core. They are still very good, but
> that includes basically no drivers, no network stack, no graphics and apparentlyno real checkpoint/restart in the face of corruption. I may be wrong on the
> last item.

You're probably right; I'm told all EROS actually does at this point
is run its own debugging and benchmarking tools. Still, the fact that
the test kernel can be that small is IMO an argument that the design
is sound.

> That nature of I/O is no different. If you always do large sequential
> block writes tell me how it will outperform a conventional OS if only
> a small number of changes in a small number of objects occur.

No seeks to read inodes, because the map from EROS's virtual-space
blocks to disk blocks is almost trivial (essentially the disks get
treated like a honkin' big series of swap volumes). So the disk
access pattern would be quite different, I think.

> Object stores are great models for some applications, thats why libraries
> for doing persistent object stores in application space exist (eg texas)
>
> Another way to look at this
>
> File System Object Store
>
> Index Inode Number Object ID
> Update Look in directory Look in an object
> Find item Find item location
> Write(maybe COW) Write(maybe COW)
> Page In Look in directory Look in an object
> Find item Find item location
> Write(maybe COW) Write(maybe COW)
> Granularity User controlled Enforced by OS
>
>
> So if I promise to call my inodes object ids, call the directory structure
> "objects" and I have a checkpointing scheme - what is the great new concept.

That, under most circumstances, you don't have to manage persistence
yourself (or to put it more concretely, no explicit disk I/O in most
applications). That's clearly a huge win, even if you end up having to do
more conventional-looking things in applications that require
commit/rollback.

And it's not clear to me that you do end up there; with one single
added atomic-flush primitive, I think you could use Lampson's
timestamp trickery to do reliable journalling without having to go all
the way to fs-like namespace management.

> o I don't think the object model is the good stuff

Even if you're right...

> o The security model is very very interesting indeed.

...this is still very true.

> o They are making it hard to help them however.

This is indeed true. However, I may have some leverage on a win-win
solution. But that's a topic for another day.

What I'm thinking is this: remember RT-Linux? Suppose the kernel were
a process running over an EROS-like layer...
---
<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

To make inexpensive guns impossible to get is to say that you're
putting a money test on getting a gun. It's racism in its worst form.
-- Roy Innis, president of the Congress of Racial Equality (CORE), 1988

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/