Re: Questions about EROS

shapj@us.ibm.com
Thu, 24 Jun 1999 12:09:51 -0400


This is the response to the allocation policy question, which can be found
below.

First, I want to emphasize that when we talk about "objects" in connection with
EROS, we are talking about pages, not about programming language units.

*** Handling of short-lived objects:

In EROS, what it means to be a short lived object is that the object is
allocated from a space bank (the EROS storage allocator), used a bit, and
returned to the space bank. On return, the space bank zeroes the object and
efficiently rescinds (using a kernel mechanism) all access to that object.

There are several layers of relevant optimizations.

First, if the object existed only within the current checkpoint interval, it
will never go to the disk.

Second, if the object is subsequently *reallocated* within the current
checkpoint interval, the old one will neither go to the checkpoint log nor to
the "home location."

Third, if an object in the checkpoint area has already been re-modified at the
time the migrator gets to it, it doesn't get moved to the home location out of
the log. The correctness argument here is non-obvious, but this is safe.

Finally, newly zeroed objects are recognized as an important case by the
implementation -- in particular zeroed pages. A page full of zeros is never
actually written to either the checkpoint log (the write-ahead log) or the disk.
In the log case, there is a distinguished value in the log directory entry
saying "this object is zero." In the "home location" objects are divided into
clusters of roughly 1000 pages. Each cluster has a side page that contains
per-object metadata, including a flags field that indicates (among other things)
whether the object is zero.

In effect, then, there are four layers of "hotness": stuff that only lives in
memory, stuff that never makes it out of the checkpoint area, stuff that never
gets copied by the migrator, and stuff that is copied using the efficient zero
case. Bear in mind that the checkpoint area covers a small number of colocated
tracks, and that the disk head is within this range with high frequency; the
usual seek cost models don't entirely apply.

** Memory system vs. file system

EROS is a single level store. For every object in memory we know exactly where
on the disk it came from. The location on the disk is determined at the time
the object is allocated rather than at the time the object is written. The key
to having good read performance later is to do your layout planning when you
allocate rather than when you write. I've hinted in previous notes at how the
space bank accomplished this. In conventional designs, the problem with this is
that it messes up your write locality. In EROS, the combination of the
append-only log plus the bulk-sorted migration largely recovers write locality.
Because the checkpoint area provides a transacted log, there is no need to order
writes to cover metadata, which reduces the number of writes by nearly 50%
relative to ext2fs (last time I looked -- these things definitely change!) for
small files, and consequently reduces the number of disk seeks associated with
write ordering. Eliminating the disk seeks is a big win -- you might find that
your BDD example worked okay.

Rather than cause fragmentation, the memory checkpoint actually tends to force
hot things to be accessable within the checkpoint area. Also, it permits bulk
motion later, as previously described. Much of the modified memory state that
is appended to the checkpoint area never gets written back to home locations,
because it is remodified before the migrator can move it. Your BDD example is a
great case in point. It heavily modifies a huge pile of stuff. At checkpoint
time the whole thing gets marked COW, and then gets written in sequential order
to the checkpoint log. By the time the migrator goes to migrate it, it's all
been modified again so there is no further work to be done.

The fact that the checkpoint log write is sequential matters quite a lot --
there are very few seeks associated with this I/O, and to first order all of the
I/O cost is in the seeks. There is still a rotational delay, and this shouldn't
be trivialized, but with write-back track caches on modern drives a lot of that
can be hidden.

In fact, if we take your example of 100Mb of data, and use a typical modern IDE
drive, where a reasonable average number of sectors/track is 120, or 60
Kbytes/track. We need to write 1667 tracks to get the data out to the disk.
On average, we will see 1/2 rotation per track plus one head-to-head or
track-to-track seek (the two have roughly the same latency for adjacent tracks
on a modern drive). Using the figures for the Quantum Fireball 1280AT, which
isn't the newest disk on the planet, we get 5.56ms per full rotation plus 3ms
per track-to-track seek, so we can get your data out in roughly 9.7 seconds,
which is actually only a 3.2% slowdown in your example, which is THE WORST CASE.
On this drive, the bus will keep up, so this is a reasonably good prediction.

Frankly, for the amount of simplification and overall performance that it gets
us, I'll take a 3.2% overhead on the corner cases.

All that having been said, I'm not convinced that I've been clear enough about
why these I/O's aren't noise relative to the allocator. Perhaps it will help to
emphasize that when you allocate a page from a space bank you are allocating a
real disk page in a known location (known to the space bank, at least), not a
probabilistic page for which it is dearly hoped that a location will be found
successfully later (e.g. consider that many UNIX systems oversubscribe the swap
area).

** Read locality

So finally, we get to read locality.

There are really two parts to this -- metadata and prefetching. Mostly, you
want a decently efficient way to get the metadata in, and then you want a decent
way to prefetch when things really go to the wall. The real question is can you
prefetch the metadata effectively so as to get a force multiplier on the page
prefetching.

The answer, briefly, is yes. The space manager that allocates this object
effectively allocates the object in preorder. Every metadata page fetched
effectively tells you where the next 224 data pages are going to be. If you are
feeling aggressive, you can in principle get all 224 pages loaded in a single
pass over the disk, limited by your working set bounds. The kernel doesn't
presently do any automatic prefetching, though we think we know how to do that.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

Iain McClatchie <iainmcc@ix.netcom.com> on 06/23/99 08:10:19 PM

To: Jonathan S Shapiro/Watson/IBM@IBMUS
cc: linux-kernel@vger.rutgers.edu
Subject: Questions about EROS

Jonathan,

The discussion about EROS has been quite entertaining. I have three
questions.

1. Massive temporary computation caches
2. Software bugs and robust versus execution states
3. Allocation policy

[First two questions deleted for brevity, see previous response]

Allocation policy

A few months ago we had a wonderful debate on linux-kernel about the
goodness of reiserfs, which is a new filesystem being developed by
Hans Reiser. I took away from that debate a notion that any decently
designed filesystem can find and read data already on the disk in a
minimum of seeks. The number of seeks, and the overall performance of
the filesystem is determined by their allocation policy: how they lay
the data down on the disk.

The filesystem in a persistent object system has the same problem
presented to it as the filesystem in a UNIX system, except (a) it is
also expected to store many short-lived objects, and (b) the entire
contents of memory are regularly swapped to disk. It seems that both
(a) and, to a lesser extent, (b) would form a kind of noise would tend
to swamp the signal that your filesystem allocator is looking for. If
you think of the allocator as being something like the malloc()
algorithm in a garbage collector, the programs in a UNIX system mostly
seperate the long-lived objects from the short-lived objects before
they get to the allocator.

Does EROS have some sort of system for implicitly or explicitly
recognising long-lived versus short-lived objects?

-Iain
iainmcc@ix.netcom.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/