Re: latest linus-2.5 BK broken

From: Andreas Dilger (adilger@clusterfs.com)
Date: Thu Jun 20 2002 - 02:26:17 EST


On Jun 19, 2002 22:24 -0700, Larry McVoy wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)

Well, I can speak to this a little bit... Given Lustre's ext3
underpinnings, we have been thinking of some interesting methods
by which we could take an existing ext3 filesystem on a disk and
"clusterify" it (i.e. have distributed coherency across multiple
clients). This would be perfectly suited for application on a
CC cluster.

Given that the network communication protocols are also abstracted
out from the Lustre core, it would probably be trivial for someone
with network/VM experience to write a "no-op" networking layer
which basically did little more than passing around page addresses
and faulting the right pages into each OSlet. The protocol design
is already set up to handle direct DMA between client and storage
target, and a CC cluster could also do away with the actual copy
involved in the DMA. We can already do "zero copy" I/O between
user-space and a remote disk with O_DIRECT and the right network
hardware (which does direct DMA from one node to another).

> "Paul McKenney" <Paul.McKenney@us.ibm.com> writes:
> Access to Devices Owned by Some Other OSlet
>
> Larry mentioned a /rdev, but if we discussed any details
> of this, I have lost them. Presumably, one would use some
> sort of IPC or doors to make this work.

I would just make access to remote devices act like NBD or something,
and have similar "network/proxy" kernel drivers to all "remote" devices.
At boot time something like devfs would instantiate the "proxy"
drivers for all of the kernels except the one which is "in control"
of that device.

For example /dev/hda would be a real IDE disk device driver on the
controlling node, but would be NBD in all of the other OSlets. It would
have the same major/minor number across all OSlets so that it presented
a uniform interface to user-space. While in some cases (e.g. FC) you
could have shared-access directly to the device, other devices don't
have the correct locking mechanisms internally to be accessed by more
than one thread at a time.

As the "network" layer between two OSlets would run basically at memory
speeds, this would not impose much of an overhead. The proxy device
interfaces would be equally useful between OSlets as with two remote
machines (e.g. remote modem access), so I have no doubt that many of
them already exist, and the others could be written rather easily.

> Access to Filesystems Owned by Some Other OSlet.
>
> For the most part, this reduces to the mmap case. However,
> partitioning popular filesystems over the OSlets could be
> very helpful. Larry mentioned that this had been prototyped.
> Paul cannot remember if Larry promised to send papers or
> other documentation, but duly requests them after the fact.
>
> Larry suggests having a local /tmp, so that /tmp is in effect
> private to each OSlet. There would be a /gtmp that would
> be a globally visible /tmp equivalent. We went round and
> round on software compatibility, Paul suggesting a hashed
> filesystem as an alternative. Larry eventually pointed out
> that one could just issue different mount commands to get
> a global filesystem in /tmp, and create a per-OSlet /ltmp.
> This would allow people to determine their own level of
> risk/performance.

Nah, just use a cluster filesystem for everything ;-). As I mentioned
previously, Lustre could run from a single (optionally shared-access) disk
(with proper, relatively minor, hacks that are just in the discussion
phase now), or it can run from distributed disks that serve the data to
the remote clients. With smart allocation of resources, OSlets will
prefer to create new files on their "local" storage unless there are
resource shortages. The fast "networking" between OSlets means even
"remote" disk access is cheap.

Cheers, Andreas

--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jun 23 2002 - 22:00:21 EST