Re: Fwd: v9fs (-mm -> 2.6.13 merge status)
From: Ronald G. Minnich
Date: Tue Jun 21 2005 - 16:19:59 EST
On Tue, 21 Jun 2005, Eric Van Hensbergen wrote:
> On 6/21/05, Andrew Morton <akpm@xxxxxxxx> wrote:
> >
> > v9fs
> >
> > I'm not sure that this has a sufficiently high
> > usefulness-to-maintenance-cost ratio.
> >
I got pointed at this discussion. Here are my $.02 on why we at LANL are
interested in v9fs.
We build clusters on the order of 2000 machines at present, with larger
systems coming along. The system which we use to run these clusters is
bproc. While bproc has proven to be very powerful to date, it does have
its limits:
- requires homogenous system
- the network protocols it uses, while simple, are somewhat ad-hoc
(as is common in this type of system)
- if you are on a bproc system as user x, using 25% of the system,
you still see 100% of the processes. This is a bit of a security issue.
We have a desire to build single-system-image looking clusters along the
bproc model, but at the same time compose those clusters of, e.g.,
Opterons and G5s. This mixing is highly desirable for compoutations that
have phases, some of which belong on one type of a machine, and some on
another.
We are going to use v9fs as the glue for our next-generation cluster
software, called 'xcpu'. Xcpu has been implemented on Plan 9 and works
there. I have ported xcpu to Linux, using v9fs as the client side and Russ
Cox's plan9ports server to write servers.
xcpu presents a remote execution service as a 9p server. xcpu has been
tested across architectures and it works very well. By summer 2006, we
hope to have cut over our bproc systems to xcpu.
That's one use for v9fs. We also plan to use v9fs to provide us with
servers for global /proc, monitoring, and control systems for our
clusters.
The global /proc is interesting. bproc provides a global /proc, but it is
incomplete; entries for, e.g., exe and maps are not filled in. bproc also
caches part of the /proc, but the rules about what is cached and what the
timeouts are, are set in the kernel module and not easily changed. We are
going to have an "aggregating" user level 9p server based on
Mirtchovskis's aggrfs, which will both aggregate all the cluster nodes,
and have caching rules that make sense in clusters of 1000s of node (for
example, it is ok to cache /proc/x/status; there is no need to cache
/proc/x/maps, and you probably don't want to anyway).
A neat capability is that if we give a user, e.g., 25% of the cluster, we
can tailor that user's name space so that they only see their procs and
the 25% of the cluster they own. This is good for security, but also good
for convenience: most users don't really care that some other user is on
75% of the cluster. Global pid spaces are neat in theory, messy in
practice at large scale. I want my global pid space to be global to *me*,
meaning I see the global space of the nodes I care about. The sysadmin,
of course, wants to see everything. All this is possible. V9fs, along with
Linux private name spaces, will allow us to provide this model: users can
see some or all of the global pid space, depending on need; users can be
constrained to only see part of the global pid space, depending on other
issues.
9p will also replace the Supermon protocol, allowing people to easily view
status information in a file system.
In addition to the cluster usage, there is also grid usage. The 9grid,
composed of plan 9 systems, is connected by 9p servers. Linux systems can
join the 9grid with no problem, once Linux has v9fs.
Were v9fs just a file system, I would not really be interested in it one
way or another; we have NFS, after all. But v9fs is really the key piece
of a new model of cluster services we are building at LANL. 9p will be the
glue, and v9fs will be the needed client side for hooking 9p servers into
the file system name space.
I'm hoping we can see v9fs in the kernel someday.
thanks
ron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/